Fine-Tuning Qwen2-VL-7B for Nutrition Table DetectionΒΆ

Fine-tune Qwen2-VL (a Vision-Language Model (VLM)) to detect nutrition tables in product images, starting from a zero-shot baseline and ending with LoRA-based experiments.

Project Abstract: This notebook documents the end-to-end process of fine-tuning the Qwen2-VL-7B model for nutrition table detection. Starting from a strong zero-shot baseline (0.590 Mean IoU), I systematically explored three QLoRA fine-tuning strategies, overcoming significant memory and hardware challenges. The best model achieved a Mean IoU of 0.771, a 30.7% relative improvement, demonstrating the effectiveness of parameter-efficient fine-tuning for specialized vision-language tasks.

πŸ“‹ Table of ContentsΒΆ

  1. Introduction & Motivation
  2. Environment & Setup
  3. Dataset Overview & Visualization
  4. Understanding the Qwen2-VL Model
  5. Zero-Shot Baseline Evaluation
  6. Fine-Tuning Strategy and Data Preparation
  7. Rationale for Parameter-Efficient Fine-Tuning (PEFT)
  8. Fine-Tuning Experiments and Training
  9. Checkpoint Evaluation
  10. Final Results and Analysis
  11. Production Deployment: Merging LoRA Adapters

Introduction & MotivationΒΆ

In this notebook, I fine-tune Qwen2-VL-7B for detecting nutrition tables from product images from Hugging Face.

If you are new to this kind of work, check out Daniel Godo's book. A Hands-On Guide to Fine-Tuning Large Language Models with PyTorch and Hugging Face by Daniel Voigt Godoy

Environment & SetupΒΆ

EnvironmentΒΆ

  • Hardware: NVIDIA A100 (40 GB)
  • Install dependencies: pip install -r requirements.txt
  • add env to the ipython kernel list
  • Hugging Face access: huggingface-cli login or set HUGGINGFACE_HUB_TOKEN
  • NOTE Colab uses: run !pip install in the cell below
InΒ [1]:
  #!pip install torch>=2.1 torchvision>=0.16 torchaudio>=2.1 \
  #              numpy pillow datasets>=2.20 transformers>=4.42 \
  #              accelerate>=0.27 trl>=0.9 peft>=0.12 safetensors>=0.4 \
  #              huggingface_hub>=0.23 tqdm matplotlib \
  #              bitsandbytes>=0.43 \
  #              qwen-vl-utils \
  #              seaborn
InΒ [2]:
# Standard library
import os
import re
import json
from pathlib import Path
from pprint import pprint
import time
import gc

# Third-party
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import ImageDraw, Image, ImageFont

import torch
from torch.utils.data import DataLoader
from torchvision.ops import box_iou

from datasets import load_dataset
from transformers import (
    AutoModelForImageTextToText,
    AutoProcessor,
    BitsAndBytesConfig,
    Qwen2VLForConditionalGeneration,
)
from peft import LoraConfig, PeftModel, get_peft_model
from trl import SFTConfig, SFTTrainer
from huggingface_hub import login
from tqdm.auto import tqdm

import bitsandbytes as bnb
from qwen_vl_utils import process_vision_info, vision_process
# import wandb
/workspace/conda/envs/qwen2vl/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
  warnings.warn(
/workspace/conda/envs/qwen2vl/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
  warnings.warn(
InΒ [3]:
print("Torch:", torch.__version__, "| CUDA build:", torch.version.cuda, "| CUDA avail:", torch.cuda.is_available())
if torch.cuda.is_available(): print("GPU:", torch.cuda.get_device_name(0))
Torch: 2.4.1+cu121 | CUDA build: 12.1 | CUDA avail: True
GPU: NVIDIA A100 80GB PCIe

Optional Settings for an Improved Jupyter ExperienceΒΆ

InΒ [29]:
from IPython.display import display, HTML, set_matplotlib_formats
display(HTML("<style>.container { width:100% !important; }</style>"))
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%config InlineBackend.figure_format = 'retina'
# Only show explicitly plt.show() outputs
%matplotlib inline
plt.ioff()
Out[29]:
<contextlib.ExitStack at 0x70cb0241cb20>
InΒ [5]:
login(token="YOUR_HF_TOKEN_HERE")
# !pip install hf_transfer
# login(token=os.environ["HUGGINGFACE_TOKEN"])
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

Utility functionsΒΆ

InΒ [6]:
def clear_memory():
    # clear the current variables and clean the GPU to free up resources.
    # Delete variables if they exist in the current global scope
    if 'inputs' in globals(): del globals()['inputs']
    if 'model' in globals(): del globals()['model']
    if 'processor' in globals(): del globals()['processor']
    if 'trainer' in globals(): del globals()['trainer']
    if 'peft_model' in globals(): del globals()['peft_model']
    if 'bnb_config' in globals(): del globals()['bnb_config']
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
InΒ [7]:
def parse_bounding_boxes(response_text: str) -> list:
    """
    Parses a model's text response to find bounding box coordinates.
    - Flexibly finds all numbers (int or float) in the text.
    - Groups them into bounding boxes of 4.
    - Converts them from the Qwen 0-1000 scale to a normalized [0, 1] scale.
    - Returns a list of lists, with each inner list being [x_min, y_min, x_max, y_max].
    - It intelligently sorts the coordinates to ensure (x_min, y_min) is the top-left corner.
    """
    all_numbers_str = re.findall(r'[-+]?\d*\.\d+|\d+', response_text)
    if len(all_numbers_str) < 4:
        return []

    all_numbers = [float(n) for n in all_numbers_str]
    num_boxes = len(all_numbers) // 4

    parsed_boxes = []
    for i in range(num_boxes):
        start_index = i * 4
        box_nums = all_numbers[start_index : start_index + 4]
        c1, c2, c3, c4 = box_nums
        x1, y1, x2, y2 = c1 / 1000.0, c2 / 1000.0, c3 / 1000.0, c4 / 1000.0

        x_min = min(x1, x2)
        y_min = min(y1, y2)
        x_max = max(x1, x2)
        y_max = max(y1, y2)

        parsed_boxes.append([x_min, y_min, x_max, y_max])

    return parsed_boxes

# Optional tests function and test cases
def run_parser_test_suite():
    """
    Tests the parse_bounding_boxes function against various text inputs.
    The expected format is a list of lists: [[x_min, y_min, x_max, y_max], ...],
    with all coordinates normalized to the [0, 1] range.
    """
    test_cases = {
        "official_four_coords": {
            "input": "I found two boxes. The first is at 0,12,0,35. The second is 0,67,0,85.",
            "expected": [[0.0, 0.012, 0.0, 0.035], [0.0, 0.067, 0.0, 0.085]]
        },
        "official_two_pairs": {
            "input": "The box is at (10, 20), (300, 400)",
            "expected": [[0.01, 0.02, 0.3, 0.4]]
        },
        "three_boxes_two_pairs": {
            "input": "(1,2),(3,4) and (5,6),(7,8) also (9,10),(11,12)",
            "expected": [[0.001, 0.002, 0.003, 0.004], [0.005, 0.006, 0.007, 0.008], [0.009, 0.01, 0.011, 0.012]]
        },
        # --- THIS TEST CASE IS NOW CORRECTED ---
        "brackets_float": {
            "input": "[0.1, 0.2, 0.3, 0.4]",
            "expected": [[0.0001, 0.0002, 0.0003, 0.0004]] # One box from four numbers
        },
        "conversational_text": {
            "input": "I think the nutrition table is around 150, 200, 550, 750 on the label.",
            "expected": [[0.15, 0.2, 0.55, 0.75]]
        },
        "desc w/ (int, int..)": {
            "input": "bounding_box: (0, 0, 1000, 1000)",
            "expected": [[0.0, 0.0, 1.0, 1.0]]
        },
        "no_box": {
            "input": "There is no nutrition table in this image.",
            "expected": []
        }
    }

    print("--- Running Final Parser Test Suite ---")
    all_passed = True
    for name, case in test_cases.items():
        result = parse_bounding_boxes(case["input"])
        is_correct = True
        if len(result) != len(case["expected"]):
            is_correct = False
        else:
            for res_box, exp_box in zip(result, case["expected"]):
                if not all(abs(r - e) < 1e-6 for r, e in zip(res_box, exp_box)):
                    is_correct = False
                    break

        if is_correct:
            print(f"βœ… PASSED: {name}")
        else:
            print(f"❌ FAILED: {name} | Got: {result}, Expected: {case['expected']}")
            all_passed = False

    if all_passed:
        print("\nπŸŽ‰ All tests passed!")


# run_parser_test_suite()
InΒ [8]:
SYSTEM_MESSAGE = (
  "You are a vision-language model specializing in nutrition-table detection.\n"
  "Detect every nutrition table in the image and respond only with lines of the form:\n"
  "nutrition-table<box(x_min, y_min),(x_max, y_max)>\n"
  "Coordinates are integers between 0 and 1000 in a normalized coordinate system (x first, then y).\n"
  "If multiple tables exist, return each on a separate line. Do not extract or describe text."
)

USER_PROMPT = "Detect all nutrition tables in this image and return the boxes."


def run_inference(example_or_image, *, model=None, processor=None, prompt=None, max_new_tokens=128):
  """
  Runs inference on a dataset example (raw or mapped) or a raw PIL image.
  """
  mdl = model if model is not None else globals().get("model")
  proc = processor if processor is not None else globals().get("processor")
  if mdl is None or proc is None:
      raise ValueError("Pass `model`/`processor`, or keep globals with those names available.")

  # Handle raw dicts with or without 'messages'
  if isinstance(example_or_image, dict):
      if "messages" in example_or_image:
          messages = example_or_image["messages"]
          image = example_or_image["image"]
      else:
          image = example_or_image["image"]
          messages = [
              {"role": "system", "content": SYSTEM_MESSAGE},
              {
                  "role": "user",
                  "content": [
                      {"type": "image", "image": image},
                      {"type": "text", "text": prompt or USER_PROMPT},
                  ],
              },
          ]
  else:
      image = example_or_image
      messages = [
          {"role": "system", "content": SYSTEM_MESSAGE},
          {
              "role": "user",
              "content": [
                  {"type": "image", "image": image},
                  {"type": "text", "text": prompt or USER_PROMPT},
              ],
          },
      ]

  # Format messages
  formatted_messages = []
  for message in messages:
      role = message.get("role")
      content = message.get("content")

      if isinstance(content, list) and content and isinstance(content[0], dict) and "type" in content[0]:
          formatted_messages.append(message)
          continue

      text = content if isinstance(content, str) else ""
      if role == "user":
          formatted_messages.append({
              "role": "user",
              "content": [
                  {"type": "image", "image": image},
                  {"type": "text", "text": text.replace("<|image_1|>", "").strip()},
              ],
          })
      else:
          formatted_messages.append({
              "role": role,
              "content": [{"type": "text", "text": text}],
          })

  text_prompt = proc.tokenizer.apply_chat_template(
      formatted_messages,
      tokenize=False,
      add_generation_prompt=True,
  )

  inputs = proc(
      text=text_prompt,
      images=[image],
      return_tensors="pt",
      padding=True,
  ).to(mdl.device)

  if "pixel_values" in inputs:
      inputs["pixel_values"] = inputs["pixel_values"].to(mdl.dtype)

  with torch.no_grad():
      generated_ids = mdl.generate(
          **inputs,
          max_new_tokens=max_new_tokens,
          do_sample=False,
          num_beams=1,
      )

  trimmed_generated_ids = [
      out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
  ]

  # Decode only the NEW tokens
  response = proc.batch_decode(trimmed_generated_ids, skip_special_tokens=True)[0]

  return response

Dataset Loading & ExplorationΒΆ

In this section, the openfoodfacts/nutrition-table-detection dataset. This dataset contains product images, the extracted bar codes, and bounding boxes for the nutrition tables.

InΒ [9]:
# load the dataset into training and evaluation sets
dataset_id = "openfoodfacts/nutrition-table-detection"
dataset_train_raw = load_dataset(dataset_id, split="train")
dataset_test_raw = load_dataset(dataset_id, split="val")
InΒ [10]:
example =dataset_train_raw[657]
pprint(example)
# raw image (scaled down)
_ = plt.figure(figsize=(6,6))
_ = plt.imshow(example["image"])
_ = plt.axis("off")
_ = plt.show()
{'height': 3053,
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2866x3053 at 0x70CB0241E2C0>,
 'image_id': '3257983357752_2',
 'meta': {'barcode': '3257983357752',
          'image_url': 'https://static.openfoodfacts.org/images/products/325/798/335/7752/2.jpg',
          'off_image_id': '2'},
 'objects': {'bbox': [[0.6387160420417786,
                       0.22644801437854767,
                       0.8336063027381897,
                       0.5094208121299744],
                      [0.5224369764328003,
                       0.8314724564552307,
                       0.6226662397384644,
                       0.8967201709747314],
                      [0.521781861782074,
                       0.8963712453842163,
                       0.6233212947845459,
                       0.9637125134468079]],
             'category_id': [0, 2, 2],
             'category_name': ['nutrition-table',
                               'nutrition-table-small-energy',
                               'nutrition-table-small-energy']},
 'width': 2866}
No description has been provided for this image

Dataset Overview & Visualization: Nutrition Table DetectionΒΆ

This project utilizes the openfoodfacts/nutrition-table-detection dataset, which is available on Hugging Face. The dataset was created by Open Food Facts and was used to train their own production model for detecting nutrition tables, providing a robust, real-world foundation for this fine-tuning task.

For our purposes, we will focus on the following key fields from each sample:

  • image: The input image loaded as a PIL object.
  • width & height: The original dimensions of the image in pixels. These are essential for visualizing the bounding boxes.
  • objects: A dictionary containing the ground-truth annotations for the image.
    • bbox: A list containing the bounding box coordinates.
    • category_name: A list containing the object's class name, the main one being 'nutrition-table'.

Normalized Bounding Box CoordinatesΒΆ

The bounding box coordinates are normalized, meaning their values range from 0 to 1. The coordinates are provided in the format [y_min, x_min, y_max, x_max].

This is a standard practice in computer vision because it makes the model's training process independent of the input image's resolution. To properly visualize these normalized coordinates on an image, we must scale them back to pixel values using the image's original width and height:

  • absolute_x = normalized_x * image_width
  • absolute_y = normalized_y * image_height
InΒ [11]:
def show_bboxes(example, show_labels=True, figsize=(8, 8)):
    """
    Show all bounding boxes for a single HF example dict from
    openfoodfacts/nutrition-table-detection.

    Args:
        example: dict with keys ["image", "objects", "image_id", ...]
        show_labels: draw 1..n in top-left inside each box
        figsize: matplotlib figure size
    """
    img = example["image"].copy()
    w, h = img.size
    draw = ImageDraw.Draw(img)

    # scale line width & font for visibility on big images
    lw = max(2, h // 400)
    fs = max(18, h // 30)
    try:
        font = ImageFont.truetype("DejaVuSans-Bold.ttf", fs)
    except:
        font = ImageFont.load_default()

    for i, bb in enumerate(example["objects"]["bbox"], start=1):
        # dataset format: [y_min, x_min, y_max, x_max] normalized
        y_min, x_min, y_max, x_max = map(float, bb)
        x0, y0 = int(x_min * w), int(y_min * h)
        x1, y1 = int(x_max * w), int(y_max * h)

        draw.rectangle([x0, y0, x1, y1], outline="red", width=lw)
        if show_labels:
            draw.text((x0 + 5, y0 + 5), str(i), fill="red", font=font)

    plt.figure(figsize=figsize)
    plt.imshow(img)
    title = f"Image ID: {example.get('image_id', 'unknown')} β€’ {len(example['objects']['bbox'])} boxes"
    plt.title(title)
    plt.axis("off")
    plt.show()

show_bboxes(example)
No description has been provided for this image

Analysis of Data DistributionsΒΆ

After visualizing the dataset, I drew several key conclusions that directly influenced my modeling and memory management strategy.

Key Observations & ImplicationsΒΆ

  • Variable Image Resolutions: The histograms show a wide distribution of image widths and heights, with no single standard size. While the Qwen2-VL architecture is designed to handle variable resolutions by breaking images into patches, this variation presents a significant memory challenge. A very large image can result in a long sequence of visual tokens, drastically increasing the VRAM required for even a single sample (batch_size=1). This observation validated my decision to implement a MAX_PIXELS limit as a crucial memory optimization technique.

  • Small Bounding Boxes: The bounding boxes for nutrition tables are typically small relative to the overall image dimensions. This suggests that the model needs to be effective at identifying small features within a larger context.

  • Handling Multiple Detections: While most images in this dataset contain a single nutrition table, a robust evaluation plan must account for cases with multiple ground-truth boxes or multiple model predictions. My approach for calculating the Mean IoU will be to match each predicted box to the ground-truth box that has the highest overlap. This ensures a fair evaluation, even in complex scenarios.

InΒ [12]:
### get the histogram of the image sizes

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def build_image_stats(ds, split_name):
  widths, heights, bbox_counts, unique_categories, categories = [], [], [], [], []

  for row in ds:
      w, h = row["image"].size
      widths.append(w)
      heights.append(h)

      names = row["objects"].get("category_name") or ["unknown"]
      bbox_counts.append(len(names))
      unique_categories.append(len(set(names)))
      categories.append(", ".join(names))

  return pd.DataFrame({
      "width": widths,
      "height": heights,
      "bbox_count": bbox_counts,
      "unique_categories": unique_categories,
      "category": categories,
      "split": split_name,
  })

df_train = build_image_stats(dataset_train_raw, "train")
df_eval = build_image_stats(dataset_test_raw, "eval")
stats_df = pd.concat([df_train, df_eval], axis=0)

sns.set_theme(style="whitegrid")
fig, axes = plt.subplots(1, 3, figsize=(18, 4))
_ = sns.histplot(data=stats_df, x="width", hue="split", stat="density", ax=axes[0], bins=30)
_ = axes[0].set_title("Image Width")
_ = sns.histplot(data=stats_df, x="height", hue="split", stat="density", ax=axes[1], bins=30)
_ = axes[1].set_title("Image Height")
# sns.histplot(data=stats_df, x="bbox_count", hue="split", discrete=True, ax=axes[2])
_ = sns.histplot(data=stats_df, x="bbox_count", hue="split", ax=axes[2])
_ = axes[2].set_title("# Bounding Boxes per Image")
fig.tight_layout()
No description has been provided for this image
InΒ [13]:
_ = plt.figure(figsize=(8,4))
# sns.countplot(data=stats_df, x="unique_categories", hue="split", discrete=True)
_ = sns.countplot(data=stats_df, x="unique_categories", hue="split")
_ = plt.title("Unique Categories per Image")
plt.show()

_ = plt.figure(figsize=(10,4))
_ = sns.countplot(data=stats_df, x="category", order=stats_df["category"].value_counts().index)
_ = plt.xticks(rotation=45, ha="right")
_ = plt.title("Category Frequency")
plt.tight_layout()
plt.show()
No description has been provided for this image
/tmp/ipykernel_696/1450487219.py:11: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all Axes decorations.
  plt.tight_layout()
No description has been provided for this image

Understanding the Qwen2-VL ModelΒΆ

Before using the model, it's important to understand its core components and data requirements.

  • Architecture: The model consists of a Vision Encoder to process image patches, a Large Language Model (LLM) for text, and a Cross-Attention Mechanism that allows the LLM to "see" the visual information. It uses 2D Rotary Position Embeddings (RoPE) in the vision encoder to effectively understand the spatial relationships between image patches.

  • The Processor: The Hugging Face processor is a critical utility that bundles all necessary preprocessing. It applies a chat template to structure the conversation, tokenizes the text, and performs "patch-ification" to convert images into a sequence of visual tokens.

  • Expected Bounding Box Format: A key detail from the official Qwen-VL paper is that the model expects bounding box coordinates to be scaled to an integer grid of 1000x1000. My data preparation pipeline handles the conversion from the dataset's normalized [0, 1] coordinates into the required format: nutrition-table<box(x1, y1),(x2, y2)>.

The Processor: A Unified Preprocessing PipelineΒΆ

The Hugging Face processor for Qwen2-VL is a critical utility that bundles all necessary preprocessing steps. It's more than just a tokenizer; it's a complete data preparation tool.

  1. Chat Template Application: The process begins with the chat template. When given a conversational input (e.g., a user prompt with text and images), the processor's apply_chat_template function formats it into a single, structured string. It inserts control tokens like <|im_start|>user to manage turns and uses <img>...</img> as placeholders for images.

  2. Vision Processing: For each image, the processor calls an internal function similar to process_vision_info. This function performs several key operations:

    • It resizes and normalizes the image to the expected dimensions and pixel value range.
    • It performs "patch-ification," dicing the image into a sequence of smaller, fixed-size patches. These patches are the visual equivalent of text tokens.
    • The final output is a pixel_values tensor, ready for the Vision Encoder.
  3. Text Tokenization: The formatted prompt string (with image placeholders) is passed to the text tokenizer, which converts it into numerical input_ids.

By handling these steps, the processor outputs a dictionary containing the input_ids, pixel_values, and attention_mask needed to feed the model.

Model Architecture and Forward PassΒΆ

The Qwen2-VL architecture is designed to fuse these two modalities:

  • The Vision Encoder, a Transformer-based network, processes the image patches to extract high-level visual features.
  • The LLM processes the text tokens.
  • A Cross-Attention Mechanism acts as the bridge, allowing the LLM to "look at" the relevant visual features from the encoder at each step of text generation.

For a prompt with multiple images, such as <img>img1.jpg</img>Describe this. Now look at <img>img2.jpg</img> and compare., the model processes each image's patches separately. It uses techniques like "forbidden attention" to ensure that when generating text about the first image, it doesn't "see" the features from the second, maintaining context.

Positional Awareness: 2D RoPEΒΆ

A key innovation in modern Transformers, including Qwen2-VL's vision encoder, is the use of 2D Rotary Position Embedding (RoPE).

  • What is it? Traditional position embeddings add a vector to each token to give it a sense of its absolute location (e.g., "this is patch #5"). RoPE, however, is a more elegant solution that rotates each patch's embedding vector by an angle proportional to its (x, y) coordinates.

  • Why is it better? This rotational method inherently encodes the relative positions between patches directly into the self-attention calculation. The model doesn't just know where a patch is; it has a built-in, efficient way to understand how far apart patch A is from patch B, both horizontally and vertically. This is crucial for vision tasks, as it helps the model understand the spatial relationships that form objects and scenes without needing extra learnable parameters for position.

InΒ [19]:
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16,
 bnb_4bit_use_double_quant=True,
)

model = AutoModelForImageTextToText.from_pretrained(
 "Qwen/Qwen2-VL-7B-Instruct",
 quantization_config=bnb_config,
 device_map="auto",
 trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", trust_remote_code=True)
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Baseline Model Memory FootprintΒΆ

Loading the base Qwen2-VL-7B model in its 16-bit format reveals its resource needs before any optimization.

  • Parameters (4.7B): The model's weights require ~8.74 GB of VRAM.
  • CUDA Allocated (9.02 GB): This is the active memory holding the model's weights.
  • CUDA Reserved (13.73 GB): This is the total memory pool PyTorch has allocated from the GPU for current and future operations (like activations during inference).

This initial ~14 GB footprint confirms that full fine-tuning is challenging even on high-end GPUs like the A100 40GB, making parameter-efficient techniques like LoRA essential.

InΒ [13]:
def print_model_memory(model):
  total_params = sum(p.numel() for p in model.parameters())
  trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
  total_gb = total_params * 2 / 1024**3  # bfloat16 weights = 2 bytes
  print(f"Parameters: {total_params:,} (~{total_gb:.2f} GB)")
  print(f"Trainable parameters: {trainable_params:,}")

  if torch.cuda.is_available():
      print(f"CUDA memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
      print(f"CUDA memory reserved:  {torch.cuda.memory_reserved()/1024**3:.2f} GB")
print_model_memory(model)
Parameters: 4,691,876,352 (~8.74 GB)
Trainable parameters: 1,091,870,720
CUDA memory allocated: 5.53 GB
CUDA memory reserved:  7.32 GB
InΒ [30]:
def evaluate_vlm(model, processor, dataset, max_samples=None, iou_threshold=0.5, max_new_tokens=128):
  """
  Evaluates a vision-language model on object detection.
  
  Calculates:
  1. True Mean IoU: Average of best IoU for each GT box (no threshold)
     - Each GT box is matched to its best prediction
     - Unmatched GT boxes contribute 0
     - This is the TRUE mean across all GT boxes
  
  2. Threshold-based metrics (precision, recall, F1):
     - Uses iou_threshold for counting TP/FP/FN
     - Greedy matching above threshold
  
  Args:
      model: VLM model
      processor: Model processor
      dataset: Test dataset (list or HF dataset)
      max_samples: Optional limit on samples
      iou_threshold: Threshold for precision/recall/F1 (NOT used for mean IoU)
      max_new_tokens: Max tokens for generation
  
  Returns:
      dict with mean_gt_iou, precision, recall, f1, samples_evaluated
  """
  model.eval()
  total_iou_sum = 0.0
  total_gt_boxes = 0
  tp, fp, fn = 0, 0, 0

  samples = dataset[:max_samples] if max_samples else dataset

  for example in samples:
      response = run_inference(
          example,
          model=model,
          processor=processor,
          max_new_tokens=max_new_tokens
      )
      pred_boxes = parse_bounding_boxes(response)
      gt_boxes = example["objects"]["bbox"]

      # Increment total ground truth boxes
      total_gt_boxes += len(gt_boxes)

      if not pred_boxes or not gt_boxes:
          if not pred_boxes:
              fn += len(gt_boxes)  # Missed all GT boxes
          if not gt_boxes:
              fp += len(pred_boxes)  # All predictions are false positives
          continue

      pred_tensor = torch.tensor(pred_boxes, dtype=torch.float32)
      gt_tensor = torch.tensor(gt_boxes, dtype=torch.float32)[:, [1, 0, 3, 2]]

      iou_matrix = box_iou(pred_tensor, gt_tensor)  # [num_pred, num_gt]

      # --- 1. True Mean IoU Calculation (No Threshold) ---
      # For each GT box, find the IoU of its best-matching prediction.
      # If a GT box has no match, its best IoU is 0.
      if iou_matrix.numel() > 0:
          best_ious_for_gt, _ = iou_matrix.max(dim=0)  # Best pred for each GT
          total_iou_sum += best_ious_for_gt.sum().item()
      # else: no predictions, all GTs contribute 0 (already counted in total_gt_boxes)

      # --- 2. Precision/Recall/F1 Calculation (With Threshold) ---
      # Use greedy matching to find true positives above threshold
      all_pairs = sorted(
          [(iou_matrix[p, g].item(), p, g)
           for p in range(iou_matrix.shape[0])
           for g in range(iou_matrix.shape[1])],
          reverse=True
      )

      matched_preds = set()
      matched_gts = set()

      for iou, p, g in all_pairs:
          if iou < iou_threshold:  # ← Threshold ONLY affects TP/FP/FN
              break
          if p in matched_preds or g in matched_gts:
              continue
          matched_preds.add(p)
          matched_gts.add(g)

      tp += len(matched_preds)
      fp += len(pred_boxes) - len(matched_preds)
      fn += len(gt_boxes) - len(matched_preds)

  # Final calculations
  mean_iou = total_iou_sum / total_gt_boxes if total_gt_boxes else 0.0
  precision = tp / (tp + fp) if (tp + fp) else 0.0
  recall = tp / (tp + fn) if (tp + fn) else 0.0
  f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0

  return {
      'mean_gt_iou': mean_iou,
      f'precision@{iou_threshold:.2f}': precision,
      f'recall@{iou_threshold:.2f}': recall,
      f'f1@{iou_threshold:.2f}': f1,
      'samples_evaluated': len(samples),
  }

Zero-Shot Baseline EvaluationΒΆ

My initial tests with a simple prompt confirmed the model's default behavior is to perform Optical Character Recognition (OCR). To get a true detection baseline, I had to engineer a more effective prompt to override this behavior.

Crafting the Final PromptΒΆ

The final prompt was designed to be highly explicit, aligning with the model's training data:

  1. It defines the task ("Detect all...").
  2. It specifies the exact output format ("nutrition_label<box...>") and coordinate system ("...on a 1000x1000 canvas").
  3. It includes a negative constraint to prevent OCR ("Do not extract or describe any text...").

Final Baseline ResultsΒΆ

Using this engineered prompt, I ran the evaluation on the entire test set of 123 samples to get the final, official baseline metrics.

  • Mean IoU: 0.27
  • F1-Score (@0.50 IoU): 0.386
  • Precision (@0.50 IoU): 0.395
  • Recall (@0.50 IoU): 0.377

This proves that while the model can be guided to understand the task, it lacks the specialized ability to perform it accurately, justifying the need for fine-tuning.

Before fine-tuning, I established a zero-shot baseline to quantify the pre-trained model's performance. This provides a clear, numerical benchmark to measure the impact of my fine-tuning efforts.

InΒ [15]:
baseline_metrics = evaluate_vlm(model, processor, dataset_test_raw, max_samples=None, iou_threshold=0.5)
print(baseline_metrics)

# sanity checks below
# run_inference(example)
# from itertools import islice

# for idx, example in enumerate(islice(dataset_test_raw, 10)):
#   response = run_inference(example, max_new_tokens=256)
#   pred_boxes = parse_bounding_boxes(response)
#   gt_boxes = example["objects"]["bbox"]

#   print(f"\nSample {idx}")
#   print("Raw response:")
#   print(response)
#   print("Decoded predicted boxes:", pred_boxes)
#   print("Ground-truth boxes:", gt_boxes)
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
{'mean_gt_iou': 0.43347253386790935, 'precision@0.50': 0.6097560975609756, 'recall@0.50': 0.5769230769230769, 'f1@0.50': 0.5928853754940712, 'samples_evaluated': 123}
InΒ [16]:
# def iou_debug(model, processor, dataset, num_samples=5):
#   samples = islice(dataset, num_samples)
#   for i, example in enumerate(samples):
#       response = run_inference(example, max_new_tokens=256)
#       preds = parse_bounding_boxes(response)
#       gts = example["objects"]["bbox"]

#       if preds:
#           gt = torch.tensor(gts, dtype=torch.float32)[:, [1,0,3,2]]
#           pr = torch.tensor(preds, dtype=torch.float32)[:, [1,0,3,2]]
#           ious = box_iou(gt, pr).max(dim=1).values.tolist()
#       else:
#           ious = [0.0] * len(gts)
#       print(f"Sample {i} IoUs:", ious)

# iou_debug(model, processor, dataset_test_raw, num_samples=5)

Qualitative Analysis of Baseline PerformanceΒΆ

To provide a visual understanding of the baseline performance, I overlaid the model's predicted bounding box (in red) on top of the ground-truth box (in green) for a sample image.

As shown, while the model correctly identifies the general region of the nutrition table, it lacks the precision needed for a practical application. The low IoU score for this sample visually corresponds to the significant misalignment between the two boxes. This qualitative result reinforces the need for fine-tuning to improve the model's localization accuracy.

InΒ [10]:
def visualize_prediction(example, response, title="Prediction vs. Ground Truth"):
  image = example["image"].copy()
  draw = ImageDraw.Draw(image)
  w, h = image.size

  # Ground truth boxes come as [ymin, xmin, ymax, xmax]
  for y_min, x_min, y_max, x_max in example["objects"]["bbox"]:
      draw.rectangle(
          [(x_min * w, y_min * h), (x_max * w, y_max * h)],
          outline="lime",
          width=4,
      )

  # Predictions from parse_bounding_boxes are [x_min, y_min, x_max, y_max]
  for x_min, y_min, x_max, y_max in parse_bounding_boxes(response):
      draw.rectangle(
          [(x_min * w, y_min * h), (x_max * w, y_max * h)],
          outline="red",
          width=4,
      )

  plt.figure(figsize=(8, 8))
  plt.imshow(image)
  plt.title(title)
  plt.axis("off")
  plt.show()



# sample = dataset_test_raw[0]
sample =dataset_train_raw[657]
response = run_inference(sample, max_new_tokens=256)
visualize_prediction(sample, response)

Fine-Tuning Strategy and Data PreparationΒΆ

With a clear baseline established, the next step is to fine-tune the model to improve its accuracy. This section outlines my strategy for training and the data preparation required.

Training Objective vs. Evaluation MetricΒΆ

A key decision in this project is to separate the training objective from the evaluation metric.

  • Training Objective (Cross-Entropy Loss): The model is trained to minimize cross-entropy loss, which measures the accuracy of token-by-token text prediction. It is a differentiable function, which is essential for backpropagation.
    • Limitation: It is strict on syntax. The model is penalized for any textual deviation from the ground truth, even if the meaning (i.e., the bounding box coordinates) is identical.
  • Evaluation Metric (Mean IoU): To measure true task success, I use Mean IoU, which calculates the geometric overlap between the predicted and ground-truth boxes. It is a direct measure of geometric accuracy.

My approach is to train with cross-entropy loss but select the best checkpoint based on the highest Mean IoU on the validation set. This aligns the final model with the true task goal and helps monitor for overfitting.

Fine-Tuning ExperimentsΒΆ

I will explore two LoRA strategies to determine the most effective fine-tuning approach:

  1. Language-Only LoRA: Adapts only the LLM to better interpret the visual features.
  2. Vision+Language LoRA: Adapts both the vision encoder and the LLM to adapt and refine the visual features themselves.

Final Training Sample StructureΒΆ

The code below shows the final data structure that will be fed into the trainer. It combines the image, the engineered prompt, and the ground-truth assistant response with coordinates scaled to the required 1000x1000 format.

[
  {
    "role": "user",
    "content": [
      { "type": "image", "image_url": "path/to/image.jpg" },
      { "type": "text", "text": "Detect all nutrition label regions in this image. Respond with their bounding boxes using the format \"nutrition_label<box(x_min, y_min),(x_max, y_max)>\" on a 1000x1000 canvas. If there are multiple labels, return all of them on separate lines. Do not extract or describe any text β€” only detect and localize the label areas." }
    ]
  },
  {
    "role": "assistant",
    "content": "nutrition-table<box(250, 300),(450, 500)>" # Example scaled coordinates
  }
]
InΒ [14]:
# Reset GPU memory before (re)loading the base model + LoRA adapters
clear_memory()
GPU allocated memory: 0.00 GB
GPU reserved memory: 0.00 GB
InΒ [15]:
!nvidia-smi
Fri Oct 10 14:54:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:41:00.0 Off |                    0 |
| N/A   34C    P0             66W /  300W |     423MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Rationale for Parameter-Efficient Fine-Tuning (PEFT)ΒΆ

Fine-tuning all 7 billion parameters of the Qwen2-VL model is not only impractical from a hardware perspective but also often suboptimal for performance. It risks catastrophic forgetting, where the model loses its powerful, general-purpose abilities, and can quickly overfit to a small dataset.

Instead, I'm using Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA). This allows me to adapt the model by training less than 0.1% of its total parameters, preserving its core knowledge while teaching it our specific task.

Why Full Fine-Tuning is Infeasible on an A100 40GB GPUΒΆ

A quick calculation demonstrates the memory constraints. For a 7-billion-parameter model, a full fine-tuning process requires VRAM for more than just the model weights:

  • Model Weights (16-bit): 7B params Γ— 2 bytes/param β‰ˆ 14 GB
  • Gradients (16-bit): 7B params Γ— 2 bytes/param β‰ˆ 14 GB
  • Optimizer States (AdamW): 7B params Γ— 8 bytes/param (for 32-bit moments) β‰ˆ 56 GB

The total, ~84 GB, exceeds the 40 GB or 80 GB capacity of a A100 GPU before even accounting for the memory needed for activations, which is where your image data comes in. This makes full fine-tuning impossible.

My Multi-Faceted Memory Optimization StrategyΒΆ

To solve this, I implemented a multi-faceted strategy where each component addresses a different part of the memory problem:

  1. LoRA & 8-bit Quantization: This is the primary solution. By freezing the original weights and only training small LoRA adapters with an 8-bit optimizer (paged_adamw_8bit), I drastically reduce the memory needed for gradients and optimizer states from >70 GB to just a few hundred megabytes.
  2. MAX_PIXELS Image Resizing: This addresses the activation memory. Even with LoRA, processing very high-resolution images can create large activation maps that cause out-of-memory (OOM) errors. By setting a maximum pixel count, I ensure that the memory required for the forward and backward passes remains within the GPU's limits, even for a batch_size=1.
  3. Gradient Checkpointing & Accumulation: These techniques are the final polish. Gradient checkpointing trades compute time for memory, and accumulating gradients over 4 steps allows me to simulate a larger, more stable batch size of 4 without the associated memory cost.

This deliberate, multi-pronged approach shows a clear understanding of the bottlenecks in VLM training and provides a robust solution.

Of course. It's a great idea to document this decision. It shows you're being thoughtful about the trade-offs between data fidelity and hardware limitations.

Here is the markdown you can add to your notebook, referencing the image distribution chart you've already created.

Pre-processing Strategy: Handling Variable Image ResolutionsΒΆ

My analysis of the dataset revealed a wide distribution of image dimensions, with a long tail of very high-resolution images.

These large outlier images can cause out-of-memory (OOM) errors during the initial data loading phase (dataset.map()), even before the trainer's optimizations are applied.

To solve this, I've implemented a two-stage resizing strategy:

  1. Pre-emptive Resizing (Safety Net): Inside my create_chat_format function, I first cap the maximum size of any image by ensuring its longest side does not exceed 1024 pixels. I chose 1024 as a balance between preserving as much visual detail as possible for the model to learn from, while still being a safe enough size to likely avoid OOM errors on the A100 40GB during data preparation.
  2. Final Resizing (MAX_PIXELS): After this initial safety check, the trainer's vision_processor takes over and applies the final MAX_PIXELS = 470,000 constraint. This ensures every image fed into the training batch has a consistent memory footprint.

This approach allows me to retain valuable detail from larger images while guaranteeing that the training process remains stable and within my VRAM budget.

InΒ [16]:
DOWNSIZE = True

def create_chat_format(sample):
  """
  Converts a sample from the OpenFoodFacts dataset to the Qwen2-VL chat format.
  *** This version correctly normalizes bounding box coordinates to a 0-1000 scale. ***
  """
  assistant_response = ""
  objects = sample["objects"]

  if DOWNSIZE:
      max_long_side = 1024
      img = sample["image"].copy()
      img.thumbnail((max_long_side, max_long_side), Image.Resampling.LANCZOS)
      sample["image"] = img

  for i in range(len(objects["bbox"])):
      category = objects["category_name"][i]
      box = objects["bbox"][i]

      y_min_norm, x_min_norm, y_max_norm, x_max_norm = box

      x_min = int(x_min_norm * 1000)
      y_min = int(y_min_norm * 1000)
      x_max = int(x_max_norm * 1000)
      y_max = int(y_max_norm * 1000)

      assistant_response += (
          f"<|object_ref_start|>{category}<|object_ref_end|>"
          f"<|box_start|>({x_min},{y_min}),({x_max},{y_max})<|box_end|> "
      )

  messages = [
      {"role": "system", "content": SYSTEM_MESSAGE},
      {
          "role": "user",
          "content": [
              {"type": "image", "image": sample["image"]},
              {"type": "text", "text": USER_PROMPT},
          ],
      },
      {"role": "assistant", "content": assistant_response.strip()},
  ]

  return {"image": sample["image"], "messages": messages}


print("Formatting training dataset...")
train_dataset = [create_chat_format(sample) for sample in dataset_train_raw]

print("Formatting evaluation dataset...")
eval_dataset = [create_chat_format(sample) for sample in dataset_test_raw]

print(f"βœ… Datasets formatted: {len(train_dataset)} train, {len(eval_dataset)} eval")
Formatting training dataset...
Formatting evaluation dataset...
βœ… Datasets formatted: 1083 train, 123 eval
InΒ [16]:
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_math_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)

print('βœ… Flash Attention kernels enabled (flash_sdp).')
βœ… Flash Attention kernels enabled (flash_sdp).
InΒ [17]:
# ----------------------------------------------------------------------------------
# CRITICAL MEMORY FIX: Set MAX_PIXELS to constrain activation memory
# ----------------------------------------------------------------------------------
# The Qwen2-VL processor converts each image into a grid of patches. The total
# number of patches is determined by the image's resolution. Without a cap,
# high-resolution images can create an extremely large number of patches,
# leading to out-of-memory errors from the activation maps during the forward pass.
#
# By setting MAX_PIXELS, we cap the total size of the feature map, which is the
# primary lever for controlling VRAM usage from image data. This provides a
# massive memory saving (~8-9 GB) compared to using original resolutions.
#
# A value of 470,400 (600 * 28 * 28) was chosen as a conservative but effective
# setting for the A100 GPU.
# ----------------------------------------------------------------------------------

vision_process.MAX_PIXELS = 600 * 28 * 28
print(f"βœ… MAX_PIXELS set to: {vision_process.MAX_PIXELS:,} pixels to manage VRAM.")

from qwen_vl_utils import process_vision_info, vision_process
import torch

# Verify MAX_PIXELS is set
print(f"MAX_PIXELS: {vision_process.MAX_PIXELS:,}")
βœ… MAX_PIXELS set to: 470,400 pixels to manage VRAM.
MAX_PIXELS: 470,400

Fine-Tuning Experiments and TrainingΒΆ

Now I'll prepare the model for fine-tuning. This involves loading the model with 4-bit quantization to manage memory and then applying the LoRA configuration.

Deconstructing the QLoRA ConfigurationΒΆ

The BitsAndBytesConfig is the core of QLoRA. Here's what the key choices mean:

  • load_in_4bit=True: This instructs the library to load the large, frozen base model with its weights quantized to 4-bits, which is the primary source of memory savings.
  • bnb_4bit_quant_type="nf4": I use the "NormalFloat 4-bit" (NF4) data type because it's specifically designed for the bell-curve distribution of neural network weights, offering better precision than standard 4-bit floats.
  • bnb_4bit_compute_dtype=torch.bfloat16: This is a critical performance setting. It tells the model to de-quantize the 4-bit weights to 16-bit bfloat16 for the actual matrix multiplications. GPUs have specialized hardware (Tensor Cores) optimized for 16-bit math, which provides a massive speedup.
InΒ [55]:
clear_memory()
model_id = "Qwen/Qwen2-VL-7B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

print("βœ… Vision-Language model and processor loaded successfully!")
GPU allocated memory: 5.55 GB
GPU reserved memory: 18.56 GB
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
βœ… Vision-Language model and processor loaded successfully!

Debugging an Out-of-Memory Error During EvaluationΒΆ

During my initial training run, I encountered an out-of-memory (OOM) error at the end of the first epoch, specifically when the validation step began.

  • Problem Diagnosis: The training itself was memory-stable, but during evaluation, the model would sometimes fail to generate an end-of-sequence token and produce an extremely long, unconstrained output. When the trainer tried to pad all validation predictions to match the length of this single long output, it attempted to allocate a massive tensor (~31 GB), causing the OOM crash.
  • The Solution: To fix this, I created a GenerationConfig object to explicitly control the generation behavior during the evaluation phase. By setting max_new_tokens=128, I provide a generous limit for the model to generate its short bounding box response, while preventing the runaway generation that caused the memory spike.

This configuration is passed to the SFTTrainer to ensure all mid-training evaluations are memory-safe.

InΒ [56]:
from transformers import GenerationConfig

generation_config = GenerationConfig(
    max_new_tokens=128,  # or 256 if you prefer
    do_sample=False,
    num_beams=1,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
)

model.generation_config = generation_config  # make it the default
# print(hasattr(model, "peft_config"))
InΒ [20]:
class VLMDataCollator:
  """
  Collate function for Qwen2-VL fine-tuning.

  - Converts a mapped dataset example (with `messages` and `image`) into the
    multimodal ChatML structure that Qwen expects: the user turn contains both the
    image and the prompt text, and assistant turns carry plain text.
  - Uses the Qwen processor to tokenize text and encode images, returning padded
    batches with `input_ids`, `pixel_values`, and other multimodal features.
  - Optionally masks the prompt tokens in `labels` (via `mask_prompt=True`) so that
    the loss is computed only on the assistant’s answer. This lets you switch between
    completion loss and full-text loss without redefining the collator.
  """

  def __init__(self, processor, mask_prompt=True):
      self.processor = processor
      self.mask_prompt = mask_prompt
      self.pad_id = processor.tokenizer.pad_token_id

  def _to_multimodal_chat(self, conversation, image):
      formatted = []
      for message in conversation:
          role = message.get('role')
          content = message.get('content')

          if isinstance(content, list) and content and isinstance(content[0], dict) and 'type' in content[0]:
              formatted.append(message)
              continue

          text = content if isinstance(content, str) else ''
          if role == 'user':
              formatted.append({
                  'role': 'user',
                  'content': [
                      {'type': 'image', 'image': image},
                      {'type': 'text', 'text': text.replace('<|image_1|>', '').strip()},
                  ],
              })
          else:
              formatted.append({
                  'role': role,
                  'content': [{'type': 'text', 'text': text}],
              })
      return formatted

  def __call__(self, features):
      processed_conversations = []
      prompts = []
      image_inputs = []

      for feature in features:
          conversation = feature['messages']
          image = feature['image']

          multimodal = self._to_multimodal_chat(conversation, image)
          processed_conversations.append(multimodal)

          prompts.append(
              self.processor.apply_chat_template(
                  multimodal, tokenize=False, add_generation_prompt=False
              )
          )

          image_inputs.append(process_vision_info(multimodal)[0])

      batch = self.processor(
          text=prompts,
          images=image_inputs,
          return_tensors='pt',
          padding=True,
      )

      batch['pixel_values'] = batch['pixel_values'].to(torch.bfloat16)

      labels = batch['input_ids'].clone()
      for idx, conversation in enumerate(processed_conversations):
          prompt_only = conversation[:-1]
          if not prompt_only:
              continue
          prompt_text = self.processor.apply_chat_template(
              prompt_only, tokenize=False, add_generation_prompt=True
          )
          prompt_ids = self.processor.tokenizer(
              prompt_text,
              add_special_tokens=False,
              return_attention_mask=False,
          ).input_ids
          if self.mask_prompt:
              labels[idx, : len(prompt_ids)] = -100

      if self.pad_id is not None:
          labels[batch['input_ids'] == self.pad_id] = -100

      batch['labels'] = labels
      return batch

Experiment Descriptions & HypothesesΒΆ

  • ➑️ Experiment 1a: Completion-Only Loss (Primary)

    • Description: LoRA on the LLM only, with loss calculated just on the assistant's answer.
    • Hypothesis: This will be the most effective method, as the model's learning is focused purely on the task of generating correct bounding box strings.
  • ➑️ Experiment 1b: Full-Text Loss (Sanity Check)

    • Description: LoRA on the LLM only, but the loss is calculated over the entire conversation, including the prompt.
    • Hypothesis: This will perform worse than 1a, as the model will waste capacity learning to predict the prompt it was already given.
  • ➑️ Experiment 2: Vision + Language LoRA (Advanced)

    • Description: LoRA adapters are applied to both the vision encoder and the language model.
    • Hypothesis: This may offer a slight improvement if the nutrition labels have distinct visual features not well-represented in the model's original pre-training data.

Training Configuration (SFTTrainer)ΒΆ

The SFTConfig is set up to balance performance and memory constraints on the A100 40GB GPU. Key choices include:

  • gradient_accumulation_steps: This allows a larger effective batch size for more stable gradients without increasing VRAM.
  • bf16=True: Enables automatic mixed-precision training, which speeds up computation significantly on modern GPUs.
  • gradient_checkpointing=True: A memory-saving technique that trades some computation time to reduce VRAM needed for storing activations.

🎯 LoRA Target Modules: LLM vs Vision Encoder (Qwen2-VL)¢

βœ… Language Model (LLM) LayersΒΆ

  • PEFT automatically matches all layers when you use simple strings like:
    target_modules=["q_proj", "v_proj"]
    
  • Matches:
    model.model.layers.0.self_attn.q_proj β†’ ...layers.27.self_attn.v_proj

πŸ’‘ Why these?
Research & practice show q_proj and v_proj are often the most impactful for LoRA in transformer attention blocks β€” tuning them gives ~90% of performance gain with minimal overhead.

πŸ–ΌοΈ Vision Encoder LayersΒΆ

  • Naming is different:
    model.visual.blocks.0.attn.qkv β†’ ...blocks.31.attn.qkv
  • Use regex to avoid accidental matches:
    r"visual\.blocks\.\d+\.attn\.qkv"
    
  • ⚠️ Avoid just "qkv" β€” too generic, may match unintended modules later.
InΒ [21]:
# ============================================================
# πŸ”§ CRITICAL FIX #2: Reduce LoRA configuration for memory efficiency
# ============================================================
# Original config had:
# - r=16 (rank 16)
# - lora_alpha=32
# - 7 target modules: ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "gate_proj", "down_proj"]
#
# This consumed ~700 MB - 1 GB for LoRA adapters alone!
#
# New config (matching N's working notebook):
# - r=8 (rank 8) β†’ 4x fewer parameters per adapter
# - lora_alpha=16 (proportional to r)
# - 2 target modules: ["q_proj", "v_proj"] β†’ 3.5x fewer modules
#
# Memory impact:
# - Before: ~700 MB for LoRA + ~360 MB gradients = ~1.06 GB
# - After:  ~200 MB for LoRA + ~100 MB gradients = ~0.30 GB
# - Savings: ~760 MB!

from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj", 
        "v_proj",
        r"visual\.blocks\.\d+\.attn\.qkv"  # ← vision encoder attention, exp2
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
# model = get_peft_model(model, peft_config)
# model.print_trainable_parameters()
InΒ [22]:
# ----------------------------------------------------------------------------------
# Training Configuration (`SFTConfig`)
# ----------------------------------------------------------------------------------
# The configuration below is optimized for a single A100 40GB GPU and implements
# an early stopping strategy by saving the model at each epoch and loading the
# best one at the end, based on the validation set's Mean IoU.
# ----------------------------------------------------------------------------------
# Memory impact of gradient checkpointing:
# - Without: ~9 GB for activations
# - With:    ~0.6-1.0 GB for activations
# - Savings: ~8 GB!
#
# Trade-off: ~20% slower training, but makes training POSSIBLE!


# EXPERIMENT_NAME = 'exp1a'
# EXPERIMENT_NAME = 'exp1b'
EXPERIMENT_NAME = 'exp2'
exp_tag = EXPERIMENT_NAME


sft_config = SFTConfig(
    output_dir=f"qwen2-7b-nutrition-a100_{exp_tag}",
    num_train_epochs=7,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    bf16=True,
    tf32=True,
    optim="paged_adamw_8bit",
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.01,
    max_grad_norm=0.3,
    save_strategy="epoch",
    load_best_model_at_end=False,      # set to False for now
    logging_steps=10,
    report_to="none",
    dataset_kwargs={"skip_prepare_dataset": True},
    remove_unused_columns=False,
)

# === Manual Evaluation Strategy ===
# We disable automatic evaluation to prevent OOM errors and will
# evaluate all saved checkpoints manually after training.
sft_config.eval_strategy = "no" #"epoch"
sft_config.load_best_model_at_end = False # having issues with in loop eval text generation control
# sft_config.metric_for_best_model = "eval_mean_gt_iou"
# sft_config.greater_is_better = True
sft_config.generation_max_length = 128

print("βœ… SFTConfig created and optimized for single A100 with early stopping.")
print(f"   Max epochs: {sft_config.num_train_epochs}")
print(f"   Best model will be selected based on: {sft_config.metric_for_best_model}")
βœ… SFTConfig created and optimized for single A100 with early stopping.
   Max epochs: 7
   Best model will be selected based on: None
InΒ [23]:
# MEMORY CHECK CELL

if 'model' not in globals():
    raise RuntimeError('Load the model before running this diagnostics cell.')

try:
    collator = vlm_collator
except NameError:
    collator = VLMDataCollator(processor)
    vlm_collator = collator

if 'batch_debug' not in locals():
    sample = train_dataset[0]
    batch_debug = collator([sample])

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
bytes_per_param = 2  # assume bfloat16 params/checkpoints
param_mem_gb = total_params * bytes_per_param / 1024**3
trainable_mem_gb = trainable_params * bytes_per_param / 1024**3

seq_len = batch_debug['input_ids'].shape[-1]
hidden_size = model.config.text_config.hidden_size
bytes_per_activation = 2  # bfloat16 activations
activation_mem_gb = (seq_len * hidden_size * bytes_per_activation *
                     sft_config.per_device_train_batch_size) / 1024**3

free_mem, total_mem = torch.cuda.mem_get_info()
free_mem_gb, total_mem_gb = free_mem / 1024**3, total_mem / 1024**3

print(f'Total params: {total_params:,} (~{param_mem_gb:.2f} GB)')
print(f'Trainable params: {trainable_params:,} (~{trainable_mem_gb:.2f} GB)')
print(f'Sequence length (debug batch): {seq_len}')
print(f'Hidden size: {hidden_size}')
print(f'Per-microbatch activation estimate: ~{activation_mem_gb:.2f} GB')
print(f'Gradient accumulation steps: {sft_config.gradient_accumulation_steps}')
print(f'Effective batch size: {sft_config.gradient_accumulation_steps * sft_config.per_device_train_batch_size}')
print(f'CUDA memory (free/total): {free_mem_gb:.2f} / {total_mem_gb:.2f} GB')
Total params: 4,691,876,352 (~8.74 GB)
Trainable params: 1,091,870,720 (~2.03 GB)
Sequence length (debug batch): 1136
Hidden size: 3584
Per-microbatch activation estimate: ~0.01 GB
Gradient accumulation steps: 4
Effective batch size: 4
CUDA memory (free/total): 40.50 / 79.25 GB
InΒ [24]:
mask_prompt = EXPERIMENT_NAME != 'exp1b' #should be true for exp1a and 2
vlm_collator = VLMDataCollator(processor, mask_prompt=mask_prompt)
print(f'βœ… Collator ready for {EXPERIMENT_NAME} (mask_prompt={mask_prompt})')
βœ… Collator ready for exp2 (mask_prompt=True)
InΒ [26]:
# can be used in training loop eval
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Decode predictions
    decoded_preds = processor.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 with pad token id in a copy of labels, then decode
    labels_copy = labels.copy()
    labels_copy[labels_copy == -100] = processor.tokenizer.pad_token_id
    decoded_labels = processor.batch_decode(labels_copy, skip_special_tokens=True)

    total_iou = 0.0
    tp = fp = fn = 0
    total_gt = 0
    iou_threshold = 0.5

    for pred_text, label_text in zip(decoded_preds, decoded_labels):
        pred_boxes = parse_bounding_boxes(pred_text)  # [x_min, y_min, x_max, y_max]
        gt_boxes = parse_bounding_boxes(label_text)   # same format now
        if not gt_boxes and not pred_boxes:
            continue
        if not pred_boxes:
            fn += len(gt_boxes)
            total_gt += len(gt_boxes)
            continue
        if not gt_boxes:
            fp += len(pred_boxes)
            continue

        pred_tensor = torch.tensor(pred_boxes, dtype=torch.float32)
        gt_tensor = torch.tensor(gt_boxes, dtype=torch.float32)

        iou_matrix = box_iou(pred_tensor, gt_tensor)
        if iou_matrix.numel() == 0:
            fn += len(gt_boxes)
            fp += len(pred_boxes)
            total_gt += len(gt_boxes)
            continue

        # greedy match
        all_pairs = [
            (iou_matrix[p, g].item(), p, g)
            for p in range(iou_matrix.shape[0])
            for g in range(iou_matrix.shape[1])
        ]
        all_pairs.sort(reverse=True)

        matched_preds = set()
        matched_gts = set()
        matched_iou_sum = 0.0
        for iou, p, g in all_pairs:
            if iou < iou_threshold:
                break
            if p in matched_preds or g in matched_gts:
                continue
            matched_preds.add(p)
            matched_gts.add(g)
            matched_iou_sum += iou

        tp += len(matched_preds)
        fp += len(pred_boxes) - len(matched_preds)
        fn += len(gt_boxes) - len(matched_preds)

        total_iou += matched_iou_sum
        total_gt += len(gt_boxes)

    mean_iou = total_iou / total_gt if total_gt else 0.0
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall = tp / (tp + fn) if (tp + fn) else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0

    return {
        "mean_gt_iou": mean_iou,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }
InΒ [18]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=vlm_collator,
    peft_config=peft_config,
    compute_metrics=compute_metrics,
)

trainer.model.print_trainable_parameters()  # just to confirm LoRA is live
train_output = trainer.train()
# print(train_output)

Evaluation Setup: SDPA Attention ImplementationΒΆ

For all evaluations (baseline and fine-tuned checkpoints), I used SDPA (Scaled Dot Product Attention) which is PyTorch's native attention implementation:

model = Qwen2VLForConditionalGeneration.from_pretrained(
  "Qwen/Qwen2-VL-7B-Instruct",
  quantization_config=bnb_config,
  device_map="auto",
  attn_implementation="sdpa",  # Use PyTorch SDPA
)

Why SDPA instead of Flash Attention?
- Compatibility: Works reliably with 4-bit quantization + bfloat16
- Stability: No kernel fallback issues during inference
- Consistency: Same attention mechanism across all evaluations (baseline + experiments)
- Sufficient Performance: Evaluation is not bottlenecked by attention (model loading takes longer)

Training vs Evaluation:
- Training: Used default attention (Flash Attention when available) for maximum memory efficiency
- Evaluation: Explicitly specified SDPA for consistent, stable inference

This ensures apples-to-apples comparison across all checkpoints and the baseline model.
InΒ [22]:
def downsize_images(sample):
  """Only resize images, keep everything else intact"""
  max_long_side = 1024
  img = sample["image"].copy()
  img.thumbnail((max_long_side, max_long_side), Image.Resampling.LANCZOS)
  sample["image"] = img
  return sample

# Apply downsizing to RAW dataset (this keeps "objects" field)
dataset_test_downsized = [downsize_images(sample) for sample in dataset_test_raw]

Checkpoint EvaluationΒΆ

InΒ [49]:
# ============================================================================
# CHECKPOINT EVALUATION - Find Best Model Using evaluate_vlm
# ============================================================================
"""
This cell evaluates all training checkpoints to find the best performing model.

WHY evaluate_vlm():
- Ensures ALL ground truth boxes are counted (matched or not)
- Unmatched GT boxes contribute 0 to IoU (included in denominator)

Example: If image has 3 GT boxes but model predicts 1:
- 1 matched box contributes its IoU (e.g., 0.8)
- 2 unmatched boxes contribute 0.0
- mean_iou = 0.8 / 3 = 0.267
"""

EXPERIMENT_NAME = 'exp2'  # CHANGE THIS: 'exp1a', 'exp1b', or 'exp2'
output_dir = f"qwen2-7b-nutrition-a100_{EXPERIMENT_NAME}"

print("="*80)
print(f"πŸ” Evaluating {EXPERIMENT_NAME} checkpoints with evaluate_vlm")
print("="*80)


# ============================================================================
# Load processor (shared across all checkpoints)
# ============================================================================
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Set MAX_PIXELS
vision_process.MAX_PIXELS = 600 * 28 * 28
print(f"βœ… MAX_PIXELS set to: {vision_process.MAX_PIXELS:,} pixels")

torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)


# ============================================================================
# Step 1: Find all checkpoint directories
# ============================================================================
all_items = os.listdir(output_dir)

def extract_checkpoint_number(checkpoint_name):
  """
  Extract step number from checkpoint name.
  
  Args:
      checkpoint_name: String like 'checkpoint-271'
  
  Returns:
      int: Step number (271) or None if not a valid checkpoint
  """
  try:
      return int(checkpoint_name.split('-')[1])
  except (IndexError, ValueError):
      return None

# Filter only valid checkpoints and sort numerically
valid_checkpoints = [d for d in all_items if extract_checkpoint_number(d) is not None]
checkpoints = sorted(valid_checkpoints, key=extract_checkpoint_number)

print(f"\nπŸ“¦ Found {len(checkpoints)} checkpoints to evaluate")
print(f"   Range: {checkpoints[0]} to {checkpoints[-1]}")

# ============================================================================
# Step 2: Evaluate each checkpoint
# ============================================================================
checkpoint_results = []

for i, checkpoint in enumerate(checkpoints, 1):
  checkpoint_path = os.path.join(output_dir, checkpoint)
  step = extract_checkpoint_number(checkpoint)

  print(f"\n[{i}/{len(checkpoints)}] Evaluating {checkpoint} (step {step})...")

    
  bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
  )

  # Load base model (quantization apples to apples comparison)
  base_model = Qwen2VLForConditionalGeneration.from_pretrained(
      "Qwen/Qwen2-VL-7B-Instruct",
      torch_dtype=torch.bfloat16,
      quantization_config=bnb_config,
      device_map="auto",
      attn_implementation="sdpa",
  )

  # Load LoRA adapter weights
  model = PeftModel.from_pretrained(base_model, checkpoint_path)

  # Evaluate with evaluate_vlm (consistent with baseline)
  metrics = evaluate_vlm(
      model,
      processor,
      dataset_test_downsized,  # Same test set as experiments
      max_samples=None,         # Evaluate all 123 samples
      iou_threshold=0.5         # Standard threshold for detection
  )

  # Store results
  checkpoint_results.append({
      'checkpoint': checkpoint,
      'checkpoint_step': step,
      'mean_gt_iou': metrics['mean_gt_iou'],      # Mean IoU over ALL GT boxes
      'precision@0.5': metrics['precision@0.50'],  # TP / (TP + FP)
      'recall@0.5': metrics['recall@0.50'],        # TP / (TP + FN)
      'f1@0.5': metrics['f1@0.50'],                # Harmonic mean
  })

  print(f"   Mean GT IoU: {metrics['mean_gt_iou']:.3f}")
  print(f"   Precision:   {metrics['precision@0.50']:.3f}")
  print(f"   Recall:      {metrics['recall@0.50']:.3f}")
  print(f"   F1 Score:    {metrics['f1@0.50']:.3f}")

  # Clean up GPU memory
  del model
  del base_model
  torch.cuda.empty_cache()

# ============================================================================
# Step 3: Find best checkpoint and save results
# ============================================================================
df = pd.DataFrame(checkpoint_results)
df = df.sort_values('checkpoint_step')

# Save detailed results
results_path = os.path.join(output_dir, f'{EXPERIMENT_NAME}_checkpoint_results.csv')
df.to_csv(results_path, index=False)
print(f"\nπŸ’Ύ Saved results to: {results_path}")

# Find best checkpoint by mean GT IoU
best_idx = df['mean_gt_iou'].idxmax()
best_checkpoint = df.loc[best_idx, 'checkpoint']
best_iou = df.loc[best_idx, 'mean_gt_iou']
best_f1 = df.loc[best_idx, 'f1@0.5']
best_step = df.loc[best_idx, 'checkpoint_step']

print("\n" + "="*80)
print(f"πŸ† BEST CHECKPOINT: {best_checkpoint}")
print("="*80)
print(f"   Step:        {best_step}")
print(f"   Mean GT IoU: {best_iou:.3f}")
print(f"   F1 Score:    {best_f1:.3f}")
print("="*80)

# Display all results in compact format
print(f"\nπŸ“Š All Checkpoint Results:")
print(df[['checkpoint_step', 'mean_gt_iou', 'f1@0.5']].to_string(index=False))

print(f"\nβœ… Checkpoint evaluation complete for {EXPERIMENT_NAME}")
Out[49]:
'\nThis cell evaluates all training checkpoints to find the best performing model.\n\nWHY evaluate_vlm():\n- Ensures ALL ground truth boxes are counted (matched or not)\n- Unmatched GT boxes contribute 0 to IoU (included in denominator)\n\nExample: If image has 3 GT boxes but model predicts 1:\n- 1 matched box contributes its IoU (e.g., 0.8)\n- 2 unmatched boxes contribute 0.0\n- mean_iou = 0.8 / 3 = 0.267\n'
================================================================================
πŸ” Evaluating exp2 checkpoints with evaluate_vlm
================================================================================
βœ… MAX_PIXELS set to: 470,400 pixels

πŸ“¦ Found 7 checkpoints to evaluate
   Range: checkpoint-271 to checkpoint-1897
InΒ [27]:
def calculate_iou(pred_boxes, gt_boxes):
  """
  Calculate mean IoU between predicted and ground truth boxes
  
  Args:
      pred_boxes: List of [x_min, y_min, x_max, y_max] (normalized, corner format)
      gt_boxes: List of [y_min, x_min, y_max, x_max] (normalized, corner format)
  """
  if not pred_boxes or not gt_boxes:
      return 0.0

  ious = []
  for gt_box in gt_boxes:
      # GT format: [y_min, x_min, y_max, x_max] -> convert to [x_min, y_min, x_max, y_max]
      gt_y_min, gt_x_min, gt_y_max, gt_x_max = gt_box

      best_iou = 0.0
      for pred_box in pred_boxes:
          # Pred format: [x_min, y_min, x_max, y_max]
          pred_x_min, pred_y_min, pred_x_max, pred_y_max = pred_box

          # Calculate intersection (both now in same coordinate system)
          x_left = max(gt_x_min, pred_x_min)
          y_top = max(gt_y_min, pred_y_min)
          x_right = min(gt_x_max, pred_x_max)
          y_bottom = min(gt_y_max, pred_y_max)

          if x_right > x_left and y_bottom > y_top:
              intersection = (x_right - x_left) * (y_bottom - y_top)

              # Calculate areas
              gt_area = (gt_x_max - gt_x_min) * (gt_y_max - gt_y_min)
              pred_area = (pred_x_max - pred_x_min) * (pred_y_max - pred_y_min)

              union = gt_area + pred_area - intersection
              iou = intersection / union if union > 0 else 0.0
              best_iou = max(best_iou, iou)

      ious.append(best_iou)

  return sum(ious) / len(ious) if ious else 0.0
InΒ [33]:
def get_sample_ious(model, processor, dataset, max_samples=None):
  """
  Calculate IoU for each sample individually.
  
  This function runs inference on each test sample and calculates the IoU
  between predicted and ground truth boxes. Used for:
  - Distribution analysis
  - Failure case identification
  - Individual sample visualization
  
  Args:
      model: Fine-tuned or baseline model
      processor: AutoProcessor for the model
      dataset: Test dataset (downsized)
      max_samples: Optional limit on samples to process
  
  Returns:
      DataFrame with columns: sample_idx, image_id, iou, prediction, pred_boxes, gt_boxes
  """
  sample_results = []
  samples = dataset[:max_samples] if max_samples else dataset

  for idx, example in enumerate(samples):
      response = run_inference(example, model=model, processor=processor)
      pred_boxes = parse_bounding_boxes(response)
      gt_boxes = example["objects"]["bbox"]
      iou = calculate_iou(pred_boxes, gt_boxes)

      sample_results.append({
          'sample_idx': idx,
          'image_id': example.get('image_id', f'sample_{idx}'),
          'iou': iou,
          'prediction': response,
          'pred_boxes': pred_boxes,
          'gt_boxes': gt_boxes
      })

      if (idx + 1) % 20 == 0:
          print(f"   Processed {idx + 1}/{len(samples)} samples...")

  return pd.DataFrame(sample_results)
InΒ [48]:
# ============================================================
# COMPLETE EXPERIMENT ANALYSIS - ALL IN ONE CELL
# ⚠️ IMPORTANT: Change EXPERIMENT_NAME for each run!
# ============================================================

# βœ… SET THIS - Change for each experiment: 'exp1a', 'exp1b', 'exp2'
# EXPERIMENT_NAME = 'exp1b'
EXPERIMENT_NAME = 'exp2'

output_dir = f"qwen2-7b-nutrition-a100_{EXPERIMENT_NAME}"
base_model_id = 'Qwen/Qwen2-VL-7B-Instruct'

print(f"\n{'='*70}")
print(f"ANALYZING EXPERIMENT: {EXPERIMENT_NAME}")
print(f"{'='*70}\n")


# ============================================================
# SETUP
# ============================================================

# Set MAX_PIXELS
vision_process.MAX_PIXELS = 600 * 28 * 28
print(f"βœ… MAX_PIXELS set to: {vision_process.MAX_PIXELS:,} pixels")

# Load processor
processor = AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)

# Disable SDPA
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)

# ============================================================
# STEP 1: Load checkpoint evaluation results
# ============================================================

results_path = os.path.join(output_dir, f'{EXPERIMENT_NAME}_checkpoint_results.csv')

# Check if results exist
if not os.path.exists(results_path):
  print(f"❌ Results file not found: {results_path}")
  print(f"   Run checkpoint evaluation first!")
  raise FileNotFoundError(results_path)

df = pd.read_csv(results_path)
df['checkpoint_step'] = df['checkpoint'].str.extract(r'(\d+)').astype(int)
df_sorted = df.sort_values('checkpoint_step')


best_checkpoint = df.loc[df["mean_gt_iou"].idxmax(), "checkpoint"]
best_iou = df.loc[df['mean_gt_iou'].idxmax(), 'mean_gt_iou']

print(f"βœ… Loaded results from: {results_path}")
print(f"βœ… Best checkpoint: {best_checkpoint} (IoU: {best_iou:.4f})")

# ============================================================
# STEP 2: Load training history
# ============================================================

trainer_state_path = os.path.join(output_dir, best_checkpoint, "trainer_state.json")

if not os.path.exists(trainer_state_path):
  print(f"❌ Training state not found: {trainer_state_path}")
  raise FileNotFoundError(trainer_state_path)

with open(trainer_state_path) as f:
  trainer_state = json.load(f)

history = pd.DataFrame(trainer_state["log_history"])
train_loss = history.loc[history["loss"].notna(), ["step", "loss"]].copy()
train_loss["epoch"] = train_loss["step"] / 271

print(f"βœ… Loaded training history")

# ============================================================
# STEP 3: Plot training progress
# ============================================================

print(f"\n{'='*70}")
print(f"Generating training progress plot...")
print(f"{'='*70}\n")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))
fig.suptitle(f'{EXPERIMENT_NAME}: Training Progress', fontsize=16)

# Training loss
_= ax1.plot(train_loss["epoch"], train_loss["loss"], linewidth=2, color='#2E86AB')
_= ax1.set_xlabel('Epoch', fontsize=12)
_= ax1.set_ylabel('Training Loss (Cross Entropy)', fontsize=12)
_= ax1.set_title('Training Loss', fontsize=14)
_= ax1.grid(True, alpha=0.3)

# Validation metrics
_= ax2_twin = ax2.twinx()
_= line1 = ax2.plot(df_sorted["checkpoint_step"] / 271, df_sorted["mean_gt_iou"],
               marker='o', linewidth=2, markersize=6, color='#A23B72', label='Mean IoU')
_= ax2.set_xlabel('Epoch', fontsize=12)
_= ax2.set_ylabel('Mean IoU', fontsize=12, color='#A23B72')
_= ax2.tick_params(axis='y', labelcolor='#A23B72')

# print(df_sorted)

line2 = ax2_twin.plot(df_sorted["checkpoint_step"] / 271, df_sorted["f1@0.5"],
                    marker='s', linewidth=2, markersize=6, color='#F18F01', label='F1 Score')
_= ax2_twin.set_ylabel('F1 Score', fontsize=12, color='#F18F01')
_= ax2_twin.tick_params(axis='y', labelcolor='#F18F01')

_= ax2.set_title('Validation Metrics', fontsize=14)
_= ax2.grid(True, alpha=0.3)

lines = line1 + line2
labels = [l.get_label() for l in lines]
ax2.legend(lines, labels, loc='lower right', fontsize=10)

plt.tight_layout()
plot_path = os.path.join(output_dir, f'training_validation_{EXPERIMENT_NAME}.png')
plt.savefig(plot_path, dpi=150, bbox_inches='tight')
print(f"βœ… Saved: {plot_path}")
plt.show()

# ============================================================
# STEP 4: Load best model for sample analysis
# ============================================================

print(f"\n{'='*70}")
print(f"Loading best model: {best_checkpoint}")
print(f"{'='*70}\n")

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
  bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForImageTextToText.from_pretrained(
  base_model_id,
  quantization_config=bnb_config,
  device_map="auto",
  attn_implementation="sdpa",  # ← This uses SDPA anyway
  trust_remote_code=True,
)

best_ckpt_path = os.path.join(output_dir, best_checkpoint)
best_model = PeftModel.from_pretrained(base_model, best_ckpt_path, is_trainable=False)
_ = best_model.eval()

print(f"βœ… Model loaded")

# ============================================================
# STEP 5: Sample IoU analysis
# ============================================================

print(f"\n{'='*70}")
print(f"Analyzing sample-level IoUs...")
print(f"{'='*70}\n")

sample_df = get_sample_ious(best_model, processor, dataset_test_downsized)

# Save sample IoUs
sample_iou_path = os.path.join(output_dir, f'sample_ious_{EXPERIMENT_NAME}.csv')
sample_df.to_csv(sample_iou_path, index=False)
print(f"βœ… Saved sample IoUs: {sample_iou_path}")

# Stratified sampling
sample_df_sorted = sample_df.sort_values('iou')

bottom_quartile = sample_df_sorted.iloc[:len(sample_df)//4]
worst_samples = bottom_quartile.nsmallest(3, 'iou')['sample_idx'].tolist()

middle_quartiles = sample_df_sorted.iloc[len(sample_df)//4:3*len(sample_df)//4]
median_samples = middle_quartiles.sample(3, random_state=42)['sample_idx'].tolist()

top_quartile = sample_df_sorted.iloc[3*len(sample_df)//4:]
best_samples = top_quartile.nlargest(3, 'iou')['sample_idx'].tolist()

print(f"\nWorst samples (0-25%): {worst_samples}")
print(f"  IoU: {[sample_df.loc[sample_df['sample_idx'] == i, 'iou'].values[0] for i in worst_samples]}")
print(f"\nMedian samples (25-75%): {median_samples}")
print(f"  IoU: {[sample_df.loc[sample_df['sample_idx'] == i, 'iou'].values[0] for i in median_samples]}")
print(f"\nBest samples (75-100%): {best_samples}")
print(f"  IoU: {[sample_df.loc[sample_df['sample_idx'] == i, 'iou'].values[0] for i in best_samples]}")

# ============================================================
# STEP 6: Visualize performance distribution
# ============================================================

print(f"\n{'='*70}")
print(f"Generating performance distribution visualization...")
print(f"{'='*70}\n")

fig, axes = plt.subplots(3, 3, figsize=(18, 18))
fig.suptitle(f'{EXPERIMENT_NAME}: Performance Distribution (Worst β†’ Median β†’ Best)', fontsize=16)

all_samples = worst_samples + median_samples + best_samples
sample_labels = ['Worst'] * 3 + ['Median'] * 3 + ['Best'] * 3

for idx, (sample_idx, label, ax) in enumerate(zip(all_samples, sample_labels, axes.flat)):
  sample = dataset_test_downsized[sample_idx]
  response = run_inference(sample, model=best_model, processor=processor, max_new_tokens=128)

  image = sample["image"].copy()
  draw = ImageDraw.Draw(image)
  w, h = image.size

  # GT (green)
  for y_min, x_min, y_max, x_max in sample["objects"]["bbox"]:
    _= draw.rectangle([(x_min * w, y_min * h), (x_max * w, y_max * h)], outline="lime", width=3);

  # Pred (red)
  pred_boxes = parse_bounding_boxes(response)
  for x_min, y_min, x_max, y_max in pred_boxes:
      draw.rectangle([(x_min * w, y_min * h), (x_max * w, y_max * h)], outline="red", width=3);

  iou = sample_df.loc[sample_df['sample_idx'] == sample_idx, 'iou'].values[0]
  _= ax.imshow(image)
  _= ax.axis('off')
  _= ax.set_title(f'{label} - IoU: {iou:.3f}', fontsize=11);

plt.tight_layout()
viz_path = os.path.join(output_dir, f'failure_analysis_{EXPERIMENT_NAME}.png')
plt.savefig(viz_path, dpi=150, bbox_inches='tight')
print(f"βœ… Saved: {viz_path}")
plt.show()


# ============================================================================
# PART 8: IoU Distribution Analysis (Bimodal Check)
# ============================================================================
print("\n" + "="*80)
print("πŸ“Š PART 8: IoU Distribution Analysis")
print("="*80)

# Plot IoU distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 5));

# Histogram
axes[0].hist(sample_df['iou'], bins=30, edgecolor='black', alpha=0.7);
axes[0].axvline(sample_df['iou'].mean(), color='red', linestyle='--',
              linewidth=2, label=f'Mean: {sample_df["iou"].mean():.3f}');
axes[0].axvline(sample_df['iou'].median(), color='green', linestyle='--',
              linewidth=2, label=f'Median: {sample_df["iou"].median():.3f}');
axes[0].set_xlabel('IoU Score', fontsize=12);
axes[0].set_ylabel('Frequency', fontsize=12);
axes[0].set_title('IoU Distribution - Test Set', fontsize=14, fontweight='bold');
axes[0].legend(fontsize=11);
axes[0].grid(axis='y', alpha=0.3)

# Cumulative distribution
sorted_ious = np.sort(sample_df['iou'])
cumulative = np.arange(1, len(sorted_ious) + 1) / len(sorted_ious) * 100
axes[1].plot(sorted_ious, cumulative, linewidth=2);
axes[1].axhline(80, color='red', linestyle='--', alpha=0.5, label='80th percentile');
axes[1].axhline(50, color='green', linestyle='--', alpha=0.5, label='50th percentile');
axes[1].set_xlabel('IoU Score', fontsize=12);
axes[1].set_ylabel('Cumulative Percentage', fontsize=12);
axes[1].set_title('Cumulative IoU Distribution', fontsize=14, fontweight='bold');
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(output_dir, f'{EXPERIMENT_NAME}_iou_distribution.png'),
          dpi=150, bbox_inches='tight')
plt.show()

# Print statistics
print(f"\nπŸ“ˆ Distribution Statistics:")
print(f"   Mean IoU:   {sample_df['iou'].mean():.3f}")
print(f"   Median IoU: {sample_df['iou'].median():.3f}")
print(f"   Std Dev:    {sample_df['iou'].std():.3f}")
print(f"\n   Min IoU:    {sample_df['iou'].min():.3f}")
print(f"   Max IoU:    {sample_df['iou'].max():.3f}")

# Quartile breakdown
q1 = sample_df['iou'].quantile(0.25)
q2 = sample_df['iou'].quantile(0.50)
q3 = sample_df['iou'].quantile(0.75)

print(f"\n   25th percentile: {q1:.3f}")
print(f"   50th percentile: {q2:.3f}")
print(f"   75th percentile: {q3:.3f}")

# Performance buckets
excellent = (sample_df['iou'] >= 0.8).sum()
good = ((sample_df['iou'] >= 0.6) & (sample_df['iou'] < 0.8)).sum()
poor = ((sample_df['iou'] >= 0.3) & (sample_df['iou'] < 0.6)).sum()
failures = (sample_df['iou'] < 0.3).sum()

total = len(sample_df)
print(f"\n   Performance Buckets:")
print(f"   Excellent (β‰₯0.8): {excellent:3d} ({excellent/total*100:.1f}%)")
print(f"   Good (0.6-0.8):   {good:3d} ({good/total*100:.1f}%)")
print(f"   Poor (0.3-0.6):   {poor:3d} ({poor/total*100:.1f}%)")
print(f"   Failures (<0.3):  {failures:3d} ({failures/total*100:.1f}%)")


# ============================================================================
# PART 9: Save All Predictions as Images
# ============================================================================
print("\n" + "="*80)
print("πŸ’Ύ PART 9: Saving All Predictions as PNGs")
print("="*80)

output_viz_dir = os.path.join(output_dir, 'all_predictions')
os.makedirs(output_viz_dir, exist_ok=True)

print(f"\nSaving {len(sample_df)} prediction visualizations...")

from PIL import ImageDraw

for idx, row in sample_df.iterrows():
  sample = dataset_test_downsized[row['sample_idx']]

  # Use the working visualization approach
  image = sample["image"].copy()
  draw = ImageDraw.Draw(image)
  w, h = image.size

  # Ground truth boxes (normalized [ymin, xmin, ymax, xmax])
  for y_min, x_min, y_max, x_max in sample["objects"]["bbox"]:
      draw.rectangle(
          [(x_min * w, y_min * h), (x_max * w, y_max * h)],
          outline="lime",
          width=4,
      );

  # Predicted boxes (from saved prediction text)
  pred_boxes = parse_bounding_boxes(row['prediction'])
  for x_min, y_min, x_max, y_max in pred_boxes:
    _= draw.rectangle(
          [(x_min * w, y_min * h), (x_max * w, y_max * h)],
          outline="red",
          width=4,
      );

  # Create matplotlib figure to save with title
  _= fig, ax = plt.subplots(figsize=(10, 8));
  _= ax.imshow(image);
  _= ax.set_title(f"IoU: {row['iou']:.3f} | Image ID: {row['image_id']}",
               fontsize=14, fontweight='bold');
  _= ax.axis('off');

  # Add legend
  handles = [
      plt.Line2D([0], [0], color='lime', linewidth=3, label='Ground Truth'),
      plt.Line2D([0], [0], color='red', linewidth=3, label='Prediction')
  ]
  _= ax.legend(handles=handles, loc='upper right', fontsize=10);

  # Save
  filename = f"{row['iou']:.3f}_{row['image_id']}.png"
  _=plt.savefig(os.path.join(output_viz_dir, filename),
              bbox_inches='tight', dpi=100);
  plt.close()

  # Don't use plt.show() - it causes hanging!

print(f"βœ… Saved {len(sample_df)} images to: {output_viz_dir}")
print(f"   Files sorted by IoU (worst to best)")

# ============================================================================
# SUMMARY
# ============================================================================
print("\n" + "="*80)
print(f"βœ… {EXPERIMENT_NAME.upper()} ANALYSIS COMPLETE!")
print("="*80)
print(f"\nπŸ“ All outputs saved to: {output_dir}/")
print(f"   β€’ Training plot: {EXPERIMENT_NAME}_training_plot.png")
print(f"   β€’ Failure cases: {EXPERIMENT_NAME}_failure_cases.png")
print(f"   β€’ IoU distribution: {EXPERIMENT_NAME}_iou_distribution.png")
print(f"   β€’ Sample-level results: {EXPERIMENT_NAME}_sample_results.csv")
print(f"   β€’ All predictions: all_predictions/ ({len(sample_df)} images)")
print("\n" + "="*80)

# ============================================================
# CLEANUP
# ============================================================

del best_model, base_model
gc.collect()
torch.cuda.empty_cache()

print(f"\n{'='*70}")
print(f"βœ… ANALYSIS COMPLETE FOR {EXPERIMENT_NAME}")
print(f"Files saved:")
print(f"  - {plot_path}")
print(f"  - {viz_path}")
print(f"  - {sample_iou_path}")
print(f"{'='*70}\n")
======================================================================
ANALYZING EXPERIMENT: exp2
======================================================================

βœ… MAX_PIXELS set to: 470,400 pixels
preprocessor_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
vocab.json: 0.00B [00:00, ?B/s]
merges.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
chat_template.json: 0.00B [00:00, ?B/s]
βœ… Loaded results from: qwen2-7b-nutrition-a100_exp2/exp2_checkpoint_results.csv
βœ… Best checkpoint: checkpoint-1626 (IoU: 0.7476)
βœ… Loaded training history

======================================================================
Generating training progress plot...
======================================================================

Out[48]:
Text(0.5, 0.98, 'exp2: Training Progress')
Out[48]:
<matplotlib.legend.Legend at 0x70cb04e9b7c0>
βœ… Saved: qwen2-7b-nutrition-a100_exp2/training_validation_exp2.png
No description has been provided for this image
======================================================================
Loading best model: checkpoint-1626
======================================================================

config.json: 0.00B [00:00, ?B/s]
model.safetensors.index.json: 0.00B [00:00, ?B/s]
model-00001-of-00005.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]
model-00002-of-00005.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]
model-00003-of-00005.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]
model-00004-of-00005.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]
model-00005-of-00005.safetensors:   0%|          | 0.00/1.09G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/244 [00:00<?, ?B/s]
βœ… Model loaded

======================================================================
Analyzing sample-level IoUs...
======================================================================

   Processed 20/123 samples...
   Processed 40/123 samples...
   Processed 60/123 samples...
   Processed 80/123 samples...
   Processed 100/123 samples...
   Processed 120/123 samples...
βœ… Saved sample IoUs: qwen2-7b-nutrition-a100_exp2/sample_ious_exp2.csv

Worst samples (0-25%): [22, 37, 35]
  IoU: [np.float64(0.0), np.float64(0.03242145956322614), np.float64(0.17803740339843654)]

Median samples (25-75%): [80, 16, 102]
  IoU: [np.float64(0.9265797347412967), np.float64(0.9340411615929093), np.float64(0.6447203293810863)]

Best samples (75-100%): [75, 6, 83]
  IoU: [np.float64(1.0), np.float64(0.9958208953030407), np.float64(0.9843365018328653)]

======================================================================
Generating performance distribution visualization...
======================================================================

Out[48]:
Text(0.5, 0.98, 'exp2: Performance Distribution (Worst β†’ Median β†’ Best)')
βœ… Saved: qwen2-7b-nutrition-a100_exp2/failure_analysis_exp2.png
No description has been provided for this image
================================================================================
πŸ“Š PART 8: IoU Distribution Analysis
================================================================================
Out[48]:
(array([ 2.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  3.,  1.,  3.,  2.,
         0.,  2.,  1.,  5.,  2.,  2.,  8.,  2.,  3.,  6.,  6.,  4.,  7.,
         6., 18., 25., 12.]),
 array([0.        , 0.03333333, 0.06666667, 0.1       , 0.13333333,
        0.16666667, 0.2       , 0.23333333, 0.26666667, 0.3       ,
        0.33333333, 0.36666667, 0.4       , 0.43333333, 0.46666667,
        0.5       , 0.53333333, 0.56666667, 0.6       , 0.63333333,
        0.66666667, 0.7       , 0.73333333, 0.76666667, 0.8       ,
        0.83333333, 0.86666667, 0.9       , 0.93333333, 0.96666667,
        1.        ]),
 <BarContainer object of 30 artists>)
Out[48]:
<matplotlib.lines.Line2D at 0x70cb04d85f90>
Out[48]:
<matplotlib.lines.Line2D at 0x70c6d9a20760>
Out[48]:
Text(0.5, 0, 'IoU Score')
Out[48]:
Text(0, 0.5, 'Frequency')
Out[48]:
Text(0.5, 1.0, 'IoU Distribution - Test Set')
Out[48]:
<matplotlib.legend.Legend at 0x70c6d9cbd7b0>
Out[48]:
[<matplotlib.lines.Line2D at 0x70c1b40118d0>]
Out[48]:
<matplotlib.lines.Line2D at 0x70cb04e0afb0>
Out[48]:
<matplotlib.lines.Line2D at 0x70c1ab4e7550>
Out[48]:
Text(0.5, 0, 'IoU Score')
Out[48]:
Text(0, 0.5, 'Cumulative Percentage')
Out[48]:
Text(0.5, 1.0, 'Cumulative IoU Distribution')
Out[48]:
<matplotlib.legend.Legend at 0x70c1b4012f80>
No description has been provided for this image
πŸ“ˆ Distribution Statistics:
   Mean IoU:   0.772
   Median IoU: 0.864
   Std Dev:    0.224

   Min IoU:    0.000
   Max IoU:    1.000

   25th percentile: 0.649
   50th percentile: 0.864
   75th percentile: 0.941

   Performance Buckets:
   Excellent (β‰₯0.8):  72 (58.5%)
   Good (0.6-0.8):    27 (22.0%)
   Poor (0.3-0.6):    19 (15.4%)
   Failures (<0.3):    5 (4.1%)

================================================================================
πŸ’Ύ PART 9: Saving All Predictions as PNGs
================================================================================

Saving 123 prediction visualizations...
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(733.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(812.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(787.5), np.float64(490.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(763.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(818.5), np.float64(649.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(798.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(870.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(574.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(764.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(824.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(232.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(249.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(639.5), np.float64(639.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(879.5), np.float64(737.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(764.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(592.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(768.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(946.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(758.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1009.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(706.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(647.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(702.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(132.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(479.5), np.float64(639.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(331.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(926.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(764.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(377.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(750.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(996.5), np.float64(888.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(824.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(852.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(578.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(613.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(962.5), np.float64(706.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(764.5), np.float64(1023.5), np.float64(-0.5))
Out[48]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
βœ… Saved 123 images to: qwen2-7b-nutrition-a100_exp2/all_predictions
   Files sorted by IoU (worst to best)

================================================================================
βœ… EXP2 ANALYSIS COMPLETE!
================================================================================

πŸ“ All outputs saved to: qwen2-7b-nutrition-a100_exp2/
   β€’ Training plot: exp2_training_plot.png
   β€’ Failure cases: exp2_failure_cases.png
   β€’ IoU distribution: exp2_iou_distribution.png
   β€’ Sample-level results: exp2_sample_results.csv
   β€’ All predictions: all_predictions/ (123 images)

================================================================================
Out[48]:
80721
======================================================================
βœ… ANALYSIS COMPLETE FOR exp2
Files saved:
  - qwen2-7b-nutrition-a100_exp2/training_validation_exp2.png
  - qwen2-7b-nutrition-a100_exp2/failure_analysis_exp2.png
  - qwen2-7b-nutrition-a100_exp2/sample_ious_exp2.csv
======================================================================

Final Results and AnalysisΒΆ

After completing all training runs with a consistent evaluation methodology (4-bit quantization, true Mean IoU, and SDPA attention), I ran a final evaluation on the test set to determine the most effective fine-tuning strategy.

Quantitative ComparisonΒΆ

Experiment Mean IoU F1@0.5 Precision@0.5 Recall@0.5 Best Checkpoint Epoch
Baseline (Zero-Shot) 0.590 0.654 0.661 0.646 - -
Exp 1a: LLM LoRA + Prompt Masking 0.771 ⭐ 0.893 0.919 0.869 checkpoint-1626 6
Exp 1b: LLM LoRA (No Masking) 0.745 0.870 0.894 0.846 checkpoint-1355 5
Exp 2: Vision+LLM LoRA + Masking 0.748 0.863 0.880 0.846 checkpoint-1626 6

Key Findings and Technical InsightsΒΆ

1. LoRA Fine-Tuning is Highly Effective βœ…

My primary experiment (Exp 1a) was the clear winner, achieving a Mean IoU of 0.771. This represents a 30.7% relative improvement over the strong zero-shot baseline of 0.590 and confirms that QLoRA is an extremely effective technique for this task.

2. Prompt Masking Provides a Clear, Efficient Benefit βœ…

As hypothesized, Exp 1a (with masking) outperformed Exp 1b (without masking), improving the Mean IoU by 3.5%. By focusing the loss on only the model's generated response, prompt masking provides a more efficient learning signal, leading to better performance.

3. Vision Encoder Tuning Shows Diminishing Returns ⚠️

Adding LoRA adapters to the vision encoder (Exp 2) did not improve performance over the LLM-only approach. This strongly suggests that the pre-trained Qwen2-VL vision encoder is already highly capable, and for this task, adapting the language model's reasoning is more impactful than adapting the visual feature extraction itself.

4. Critical Technical Finding: Evaluation Consistency is Key πŸ”

Throughout this project, I confirmed several critical factors for accurate and reproducible evaluation:

  • Image Resolution Must Match Training: Evaluating with a different resolution than was used in training caused a 41% performance drop.
  • True Mean IoU: I implemented a corrected Mean IoU calculation where every ground-truth box contributes to the score, providing a more honest evaluation.
  • Quantization Consistency: All evaluations use the same 4-bit quantization as training to ensure a fair, "apples-to-apples" comparison that reflects a realistic deployment scenario.

Qualitative AnalysisΒΆ

IoU Distribution Comparison

The fine-tuning dramatically improved performance, shifting the IoU distribution from a broad, uncertain spread to a sharp peak of high-quality predictions.

Training Progression & Failure Analysis

All experiments showed healthy training dynamics. The validation IoU curves demonstrate that the models learned effectively, with Exp 1a showing the most consistent improvement. The failure analysis shows that the fine-tuned model is significantly more precise than the baseline.


Qualitative Analysis of Each ExperimentΒΆ

Baseline vs. Best Model (Exp 1a): The fine-tuned model is significantly more precise, as shown by the tighter IoU distribution around a much higher mean.

  • IoU Distribution Plots: Baseline IoU Distribution Experiment 1a IoU Distribution

Training Progression: The validation IoU curves show that all experiments learned effectively, with Exp 1a continuing to improve through all 7 epochs.

  • Training & Validation Curves (Exp 1a): Training curves for Experiment 1a
  • Training & Validation Curves (Exp 1a): Training curves for Experiment 1a
  • Training & Validation Curves (Exp 2): Training curves for Experiment 2

AppendixΒΆ

Production Deployment: Merging LoRA AdaptersΒΆ

For production deployment with vLLM/Triton, LoRA adapters must be merged into the base model.

Benefits:

  • βœ… Single model artifact (easier deployment)
  • βœ… Faster inference (no adapter overhead)
  • βœ… Compatible with vLLM (required for vision models)

Use the deployment script:

python deploy_to_vllm.py \
    --adapter_path qwen2-7b-nutrition-a100_exp1a/checkpoint-1626 \
    --output_dir qwen2-7b-nutrition-merged

See deploy_to_vllm.py for implementation details.


Manual merge example (for reference):

InΒ [31]:
# ============================================================================
# BASELINE EVALUATION - Quick Analysis with evaluate_vlm
# ============================================================================
"""
Evaluates the baseline Qwen2-VL-7B model (no fine-tuning) on the test set.

EVALUATION SETUP:
- Uses 4-bit quantization (same as fine-tuned experiments)
- SDPA attention (production-ready)
- evaluate_vlm() for apples-to-apples comparison

This ensures fair comparison: baseline and fine-tuned models evaluated
under identical conditions (quantization, attention, test set).
"""

print("="*80)
print("πŸ” BASELINE MODEL EVALUATION")
print("="*80)

baseline_dir = "qwen2-7b-nutrition-baseline"
os.makedirs(baseline_dir, exist_ok=True)

# ============================================================================
# Step 1: Load baseline model (quantized like fine-tuned models during eval)
# ============================================================================
print("\nπŸ“¦ Loading baseline Qwen2-VL-7B model...")

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
)

baseline_model = Qwen2VLForConditionalGeneration.from_pretrained(
  "Qwen/Qwen2-VL-7B-Instruct",
  quantization_config=bnb_config,
  device_map="auto",
  attn_implementation="sdpa",
)
baseline_processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

print("βœ… Baseline model loaded")

# ============================================================================
# Step 2: Evaluate with evaluate_vlm
# ============================================================================
print("\nπŸ“Š Running evaluate_vlm on test set...")

baseline_metrics = evaluate_vlm(
  baseline_model,
  baseline_processor,
  dataset_test_downsized,  # Same test set as experiments
  max_samples=None,
  iou_threshold=0.5
)

print(f"\n🎯 Baseline Performance:")
print(f"   Mean GT IoU:  {baseline_metrics['mean_gt_iou']:.3f}")
print(f"   Precision@0.5: {baseline_metrics['precision@0.50']:.3f}")
print(f"   Recall@0.5:    {baseline_metrics['recall@0.50']:.3f}")
print(f"   F1@0.5:        {baseline_metrics['f1@0.50']:.3f}")

# Save metrics

metrics_path = os.path.join(baseline_dir, 'baseline_metrics.json')
with open(metrics_path, 'w') as f:
  json.dump(baseline_metrics, f, indent=2)
print(f"\nπŸ’Ύ Saved metrics to: {metrics_path}")

# ============================================================================
# Step 3: Get per-sample IoUs for distribution analysis
# ============================================================================
print("\nπŸ” Calculating per-sample IoUs...")

sample_df = get_sample_ious(baseline_model, baseline_processor, dataset_test_downsized)

print(f"βœ… Evaluated {len(sample_df)} samples")
print(f"   Mean IoU (per-sample): {sample_df['iou'].mean():.3f}")
print(f"   Median IoU: {sample_df['iou'].median():.3f}")

# Save
sample_path = os.path.join(baseline_dir, 'baseline_sample_results.csv')
sample_df.to_csv(sample_path, index=False)
print(f"   Saved: {sample_path}")

# ============================================================================
# Step 4: Plot IoU distribution
# ============================================================================
print("\nπŸ“Š Creating IoU distribution plot...")

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Histogram
axes[0].hist(sample_df['iou'], bins=30, edgecolor='black', alpha=0.7, color='coral')
axes[0].axvline(sample_df['iou'].mean(), color='red', linestyle='--',
              linewidth=2, label=f'Mean: {sample_df["iou"].mean():.3f}')
axes[0].axvline(sample_df['iou'].median(), color='green', linestyle='--',
              linewidth=2, label=f'Median: {sample_df["iou"].median():.3f}')
axes[0].set_xlabel('IoU Score', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('BASELINE - IoU Distribution', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(axis='y', alpha=0.3)

# Cumulative
sorted_ious = np.sort(sample_df['iou'])
cumulative = np.arange(1, len(sorted_ious) + 1) / len(sorted_ious) * 100
axes[1].plot(sorted_ious, cumulative, linewidth=2, color='coral')
axes[1].axhline(80, color='red', linestyle='--', alpha=0.5, label='80th percentile')
axes[1].axhline(50, color='green', linestyle='--', alpha=0.5, label='50th percentile')
axes[1].set_xlabel('IoU Score', fontsize=12)
axes[1].set_ylabel('Cumulative Percentage', fontsize=12)
axes[1].set_title('BASELINE - Cumulative Distribution', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
dist_path = os.path.join(baseline_dir, 'baseline_iou_distribution.png')
plt.savefig(dist_path, dpi=150, bbox_inches='tight')
plt.show()
print(f"βœ… Saved: {dist_path}")

# Print stats
q1, q2, q3 = sample_df['iou'].quantile([0.25, 0.50, 0.75])
excellent = (sample_df['iou'] >= 0.8).sum()
good = ((sample_df['iou'] >= 0.6) & (sample_df['iou'] < 0.8)).sum()
poor = ((sample_df['iou'] >= 0.3) & (sample_df['iou'] < 0.6)).sum()
failures = (sample_df['iou'] < 0.3).sum()
total = len(sample_df)

print(f"\nπŸ“ˆ Distribution Statistics:")
print(f"   Quartiles: {q1:.3f} / {q2:.3f} / {q3:.3f}")
print(f"\n   Performance Buckets:")
print(f"   Excellent (β‰₯0.8): {excellent:3d} ({excellent/total*100:.1f}%)")
print(f"   Good (0.6-0.8):   {good:3d} ({good/total*100:.1f}%)")
print(f"   Poor (0.3-0.6):   {poor:3d} ({poor/total*100:.1f}%)")
print(f"   Failures (<0.3):  {failures:3d} ({failures/total*100:.1f}%)")

# ============================================================================
# Step 5: Save all predictions as marked-up images
# ============================================================================
print("\nπŸ’Ύ Saving all predictions as marked-up images...")

viz_dir = os.path.join(baseline_dir, 'all_predictions')
os.makedirs(viz_dir, exist_ok=True)

for idx, row in sample_df.iterrows():
  sample = dataset_test_downsized[row['sample_idx']]

  image = sample["image"].copy()
  draw = ImageDraw.Draw(image)
  w, h = image.size

  # Ground truth (green)
  for y_min, x_min, y_max, x_max in sample["objects"]["bbox"]:
      draw.rectangle(
          [(x_min * w, y_min * h), (x_max * w, y_max * h)],
          outline="lime", width=4
      );

  # Predictions (red)
  for x_min, y_min, x_max, y_max in row['pred_boxes']:
      draw.rectangle(
          [(x_min * w, y_min * h), (x_max * w, y_max * h)],
          outline="red", width=4
      );

  # Save
  fig, ax = plt.subplots(figsize=(10, 8));
  ax.imshow(image);
  ax.set_title(f"IoU: {row['iou']:.3f} | Image ID: {row['image_id']}",
               fontsize=14, fontweight='bold');
  ax.axis('off');

  handles = [
      plt.Line2D([0], [0], color='lime', linewidth=3, label='Ground Truth'),
      plt.Line2D([0], [0], color='red', linewidth=3, label='Prediction')
  ]
  ax.legend(handles=handles, loc='upper right', fontsize=10);

  filename = f"{row['iou']:.3f}_{row['image_id']}.png"
  plt.savefig(os.path.join(viz_dir, filename), bbox_inches='tight', dpi=100);
  plt.close()

print(f"βœ… Saved {len(sample_df)} images to: {viz_dir}")

# ============================================================================
# Summary
# ============================================================================
print("\n" + "="*80)
print("βœ… BASELINE EVALUATION COMPLETE!")
print("="*80)
print(f"\nπŸ“ All outputs in: {baseline_dir}/")
print(f"   β€’ baseline_metrics.json")
print(f"   β€’ baseline_iou_distribution.png")
print(f"   β€’ baseline_sample_results.csv")
print(f"   β€’ all_predictions/ ({len(sample_df)} images)")
print("\n" + "="*80)

# Cleanup
del baseline_model
torch.cuda.empty_cache()

print("\n🎯 Ready to compare with fine-tuned experiments!")
Out[31]:
'\nEvaluates the baseline Qwen2-VL-7B model (no fine-tuning) on the test set.\n\nEVALUATION SETUP:\n- Uses 4-bit quantization (same as fine-tuned experiments)\n- SDPA attention (production-ready)\n- evaluate_vlm() for apples-to-apples comparison\n\nThis ensures fair comparison: baseline and fine-tuned models evaluated\nunder identical conditions (quantization, attention, test set).\n'
================================================================================
πŸ” BASELINE MODEL EVALUATION
================================================================================

πŸ“¦ Loading baseline Qwen2-VL-7B model...
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
βœ… Baseline model loaded

πŸ“Š Running evaluate_vlm on test set...

🎯 Baseline Performance:
   Mean GT IoU:  0.590
   Precision@0.5: 0.661
   Recall@0.5:    0.646
   F1@0.5:        0.654

πŸ’Ύ Saved metrics to: qwen2-7b-nutrition-baseline/baseline_metrics.json

πŸ” Calculating per-sample IoUs...
   Processed 20/123 samples...
   Processed 40/123 samples...
   Processed 60/123 samples...
   Processed 80/123 samples...
   Processed 100/123 samples...
   Processed 120/123 samples...
βœ… Evaluated 123 samples
   Mean IoU (per-sample): 0.607
   Median IoU: 0.605
   Saved: qwen2-7b-nutrition-baseline/baseline_sample_results.csv

πŸ“Š Creating IoU distribution plot...
Out[31]:
(array([ 2.,  1.,  0.,  3.,  3.,  3.,  1.,  2.,  5.,  4.,  2.,  1.,  7.,
         4.,  4.,  5.,  3.,  9.,  8.,  4.,  1.,  2.,  4.,  6.,  8.,  3.,
         3.,  8., 10.,  7.]),
 array([0.        , 0.03316097, 0.06632195, 0.09948292, 0.1326439 ,
        0.16580487, 0.19896585, 0.23212682, 0.2652878 , 0.29844877,
        0.33160974, 0.36477072, 0.39793169, 0.43109267, 0.46425364,
        0.49741462, 0.53057559, 0.56373656, 0.59689754, 0.63005851,
        0.66321949, 0.69638046, 0.72954144, 0.76270241, 0.79586339,
        0.82902436, 0.86218533, 0.89534631, 0.92850728, 0.96166826,
        0.99482923]),
 <BarContainer object of 30 artists>)
Out[31]:
<matplotlib.lines.Line2D at 0x70c1ab24b1f0>
Out[31]:
<matplotlib.lines.Line2D at 0x70c1ab3d0ac0>
Out[31]:
Text(0.5, 0, 'IoU Score')
Out[31]:
Text(0, 0.5, 'Frequency')
Out[31]:
Text(0.5, 1.0, 'BASELINE - IoU Distribution')
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab3b6b60>
Out[31]:
[<matplotlib.lines.Line2D at 0x70c1ab34c8e0>]
Out[31]:
<matplotlib.lines.Line2D at 0x70c1ab3bdcf0>
Out[31]:
<matplotlib.lines.Line2D at 0x70c1ab3bd360>
Out[31]:
Text(0.5, 0, 'IoU Score')
Out[31]:
Text(0, 0.5, 'Cumulative Percentage')
Out[31]:
Text(0.5, 1.0, 'BASELINE - Cumulative Distribution')
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab34c640>
No description has been provided for this image
βœ… Saved: qwen2-7b-nutrition-baseline/baseline_iou_distribution.png

πŸ“ˆ Distribution Statistics:
   Quartiles: 0.414 / 0.605 / 0.830

   Performance Buckets:
   Excellent (β‰₯0.8):  38 (30.9%)
   Good (0.6-0.8):    26 (21.1%)
   Poor (0.3-0.6):    39 (31.7%)
   Failures (<0.3):   20 (16.3%)

πŸ’Ύ Saving all predictions as marked-up images...
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab49a920>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.408 | Image ID: 0041129077641_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aae064d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab5deb30>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.273 | Image ID: 0070200581159_1')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab454550>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab55a890>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.586 | Image ID: 0064144043064_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab435270>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab6a2c50>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.496 | Image ID: 0051500700167_1')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab5beaa0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab7e0640>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.588 | Image ID: 0016447100951_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab6662c0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab740dc0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.952 | Image ID: 26195070_4')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab617370>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab889840>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.995 | Image ID: 0041244641024_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(733.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab7659f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab806a10>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.903 | Image ID: 26235066_5')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab848640>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab925bd0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.983 | Image ID: 3571492670004_7')
Out[31]:
(np.float64(-0.5), np.float64(812.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab9e18a0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ababc5b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.965 | Image ID: 3256224806950_3')
Out[31]:
(np.float64(-0.5), np.float64(787.5), np.float64(490.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab962a40>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abbe1120>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.462 | Image ID: 0072417153051_3')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aba73820>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abb2dc30>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.861 | Image ID: 103106_4')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(763.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abbe2500>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abc8d6c0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.142 | Image ID: 20635923_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abb5e140>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abc1ab30>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.316 | Image ID: 0072417159152_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abcac400>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abde1d20>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.223 | Image ID: 3250390768296_4')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abc45e70>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abd46da0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.584 | Image ID: 0071479000013_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abc047f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abee1c00>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.734 | Image ID: 0011225033674_1')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abd5a050>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abe7ff40>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.192 | Image ID: 7610700946053_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abef62c0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abfb7af0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.470 | Image ID: 0064100108219_3')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abe92620>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abf40c40>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.708 | Image ID: 0073416006102_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abf5d9f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac0eb130>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.602 | Image ID: 26058450_2')
Out[31]:
(np.float64(-0.5), np.float64(818.5), np.float64(649.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abf7e410>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac08fd30>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.424 | Image ID: 27429464_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac0ccb50>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac10eda0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.000 | Image ID: 26052656_1')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac03add0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac2b7f70>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.928 | Image ID: 26036120_5')
Out[31]:
(np.float64(-0.5), np.float64(798.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac1603d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac3ed600>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.979 | Image ID: 0042272005420_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(870.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac2b5690>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac35bdf0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.949 | Image ID: 26191225_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac3b2ec0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac4b43d0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.053 | Image ID: 3250391868322_1')
Out[31]:
(np.float64(-0.5), np.float64(574.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac3966e0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac439180>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.182 | Image ID: 3222472951582_1')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac4d49d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac5bae30>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.519 | Image ID: 0028400160148_3')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac476140>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac57ba00>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.812 | Image ID: 50172436_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac5e8a00>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b409d120>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.910 | Image ID: 3564700435106_1')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac58c490>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b41feb90>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.625 | Image ID: 0070852000527_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1b4080d60>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1d436b100>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.419 | Image ID: 0074333375531_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1b4011d80>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1d446a950>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.591 | Image ID: 0024138012322_3')
Out[31]:
(np.float64(-0.5), np.float64(764.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1d422a500>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b4012b90>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.546 | Image ID: 0028400071345_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1d424bf10>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b40e0370>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.239 | Image ID: 0041449003153_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1b4b58af0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac5e1750>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.956 | Image ID: 20720162_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac5275e0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac43a530>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.129 | Image ID: 3700003780349_1')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(824.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac58e110>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac36fc10>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.717 | Image ID: 0037600106252_4')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac5eb730>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac3da800>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.109 | Image ID: 0070970471254_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac36ff70>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac2932b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.160 | Image ID: 01575118_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac3b04f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac2e7df0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.513 | Image ID: 0016000264694_1')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac22c460>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac0051b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.291 | Image ID: 0021000653218_1')
Out[31]:
(np.float64(-0.5), np.float64(232.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac2d1ed0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac07fb80>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.518 | Image ID: 20840822_1')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac08d360>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abf1d6f0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.844 | Image ID: 0011156054502_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac0b3d30>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abfb57b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.288 | Image ID: 0016000275348_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac0cf8b0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abe7d570>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.579 | Image ID: 0046100001639_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abe07e50>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abee2c80>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.449 | Image ID: 0018627703211_3')
Out[31]:
(np.float64(-0.5), np.float64(249.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abe47790>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abd1ef20>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.759 | Image ID: 0073007107140_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abebbd00>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abdacb20>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.798 | Image ID: 0036200013694_2')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abd5a410>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abc076d0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.953 | Image ID: 26191218_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abdae0e0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abc44f40>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.271 | Image ID: 20574369_2')
Out[31]:
(np.float64(-0.5), np.float64(639.5), np.float64(639.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abc4e9b0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abccd8a0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.961 | Image ID: 26167932_4')
Out[31]:
(np.float64(-0.5), np.float64(879.5), np.float64(737.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abcae9b0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abbb2bc0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.827 | Image ID: 0049000027624_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abb0a0e0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aba73dc0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.892 | Image ID: 20574444_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(764.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abb48d30>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abae63b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.448 | Image ID: 0038000787270_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ababfb20>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab9a2800>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.754 | Image ID: 0027400264993_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab9099c0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab9e23b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.191 | Image ID: 2407968021654_2')
Out[31]:
(np.float64(-0.5), np.float64(592.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab9838e0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab8dde40>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.624 | Image ID: 0014054030715_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab84a4a0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab764f70>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.154 | Image ID: 50300853_1')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab8df8b0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab617040>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.117 | Image ID: 3250391868322_6')
Out[31]:
(np.float64(-0.5), np.float64(768.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab8feb30>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab6f7ca0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.662 | Image ID: 3410280003832_1')
Out[31]:
(np.float64(-0.5), np.float64(946.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab665270>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab518fa0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.901 | Image ID: 3230140005024_3')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab6a1e40>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab5fcf10>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.796 | Image ID: 26067674_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab519cc0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab45a200>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.433 | Image ID: 20719159_3')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab5df5e0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aafad030>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.616 | Image ID: 0034000123803_7')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(758.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab459de0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aabc5690>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.571 | Image ID: 3222473615476_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab4c2e30>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aa40e500>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.328 | Image ID: 0067312002832_4')
Out[31]:
(np.float64(-0.5), np.float64(1009.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aaa7aa10>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aa78c0d0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.979 | Image ID: 26015637_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aab16ce0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c20c142800>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.943 | Image ID: 0025616102504_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aa78ceb0>
Out[31]:
<matplotlib.image.AxesImage at 0x70cb00b598a0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.605 | Image ID: 20551926_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c2145e5780>
Out[31]:
<matplotlib.image.AxesImage at 0x70cb0019aad0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.332 | Image ID: 0054800010080_3')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c2142ffb80>
Out[31]:
<matplotlib.image.AxesImage at 0x70cb001f6680>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.791 | Image ID: 20117795_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(706.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70cb00199d50>
Out[31]:
<matplotlib.image.AxesImage at 0x70cab7909de0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.390 | Image ID: 0043647020017_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70cadff94520>
Out[31]:
<matplotlib.image.AxesImage at 0x70c6800fb4f0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.418 | Image ID: 0043646210389_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70cab790a680>
Out[31]:
<matplotlib.image.AxesImage at 0x70c6881d6c20>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.928 | Image ID: 26117959_7')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(647.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c6800fad10>
Out[31]:
<matplotlib.image.AxesImage at 0x70c68814f160>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.663 | Image ID: 20165079_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c6801190f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c6fb3da0e0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.832 | Image ID: 24632621_4')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c68814e830>
Out[31]:
<matplotlib.image.AxesImage at 0x70c6fb324640>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.467 | Image ID: 9300601250240_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c6fb3d80a0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c68814cf70>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.828 | Image ID: 20674540_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c6fb3e9c90>
Out[31]:
<matplotlib.image.AxesImage at 0x70c6801349d0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.577 | Image ID: 0072878515276_2')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c688139ed0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c21466f7f0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.978 | Image ID: 3250390023777_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c6881b1e10>
Out[31]:
<matplotlib.image.AxesImage at 0x70cadff95bd0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.348 | Image ID: 0037466016450_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c21468fa60>
Out[31]:
<matplotlib.image.AxesImage at 0x70cadff17760>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.951 | Image ID: 3068320112893_9')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(702.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70cb002eaf80>
Out[31]:
<matplotlib.image.AxesImage at 0x70cb024fab60>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.273 | Image ID: 93300292_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70cadffb0b50>
Out[31]:
<matplotlib.image.AxesImage at 0x70c20c140130>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.892 | Image ID: 2000000033325_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(132.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c2142fc610>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aa504be0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.543 | Image ID: 00854252_6')
Out[31]:
(np.float64(-0.5), np.float64(479.5), np.float64(639.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c21452a7d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aaa7a5f0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.411 | Image ID: 01642582_1')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aa78e2f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aafc1ed0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.634 | Image ID: 0014100074120_2')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aabed570>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab456ad0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.606 | Image ID: 0038000316104_4')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1aae07130>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab5dcdf0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.410 | Image ID: 0021000419074_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab455690>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab59c910>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.862 | Image ID: 3250390768296_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab5fdfc0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab6f5e10>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.800 | Image ID: 20298302_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab558700>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab665510>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.943 | Image ID: 3284230002240_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab6dcaf0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab7658d0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.950 | Image ID: 26155432_2')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab614d90>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab8de6b0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.019 | Image ID: 3350031653285_1')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab766200>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab9e1d20>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.815 | Image ID: 0072869110138_3')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab8fc1f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ab983010>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.258 | Image ID: 20520090_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab9b6860>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abaac520>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.785 | Image ID: 20608668_3')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ab9a06d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1aba4af80>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.638 | Image ID: 0041498000028_3')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abae7f70>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abb49570>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.596 | Image ID: 0052159000073_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abbcb4c0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abb09390>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.648 | Image ID: 0063667090067_4')
Out[31]:
(np.float64(-0.5), np.float64(331.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abbb0310>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abcae410>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.583 | Image ID: 0021130079278_2')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abcccf70>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abc4fee0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.964 | Image ID: 26101989_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(926.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abc71360>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abdaed70>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.499 | Image ID: 0016000106406_2')
Out[31]:
(np.float64(-0.5), np.float64(758.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abc05240>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abd1c160>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.917 | Image ID: 20472313_2')
Out[31]:
(np.float64(-0.5), np.float64(764.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abdaf550>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abeba110>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.915 | Image ID: 0016000442825_4')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(377.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abd0a320>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abe476a0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.733 | Image ID: 20889869_2')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abee34c0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1abf41ed0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.770 | Image ID: 26212630_6')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(750.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abe44df0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac0eae00>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.303 | Image ID: 3596710458455_7')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abf90730>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac08e590>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.960 | Image ID: 22000279_5')
Out[31]:
(np.float64(-0.5), np.float64(996.5), np.float64(888.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1abf22b00>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac14b130>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.505 | Image ID: 8480000342096_3')
Out[31]:
(np.float64(-0.5), np.float64(824.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac07e980>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac2d2980>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.795 | Image ID: 0030000059708_1')
Out[31]:
(np.float64(-0.5), np.float64(575.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac10df30>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac22ed70>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.496 | Image ID: 3257984581972_1')
Out[31]:
(np.float64(-0.5), np.float64(852.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac2d0040>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac3da2c0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.921 | Image ID: 0070177067731_2')
Out[31]:
(np.float64(-0.5), np.float64(578.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac22de40>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac36e200>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.601 | Image ID: 0058449771807_2')
Out[31]:
(np.float64(-0.5), np.float64(613.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac3595d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac463fd0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.402 | Image ID: 9300601462889_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(575.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac36f370>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1ac5b95a0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.328 | Image ID: 0073141152327_2')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac4600d0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b40f2c50>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.624 | Image ID: 0071962226104_3')
Out[31]:
(np.float64(-0.5), np.float64(1023.5), np.float64(767.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1ac5e18a0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b41fc610>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.802 | Image ID: 0039000081047_4')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1b40f3a90>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1d437d690>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.793 | Image ID: 26209142_6')
Out[31]:
(np.float64(-0.5), np.float64(962.5), np.float64(706.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1b41fe650>
Out[31]:
<matplotlib.image.AxesImage at 0x70c1b4011fc0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.820 | Image ID: 20511586_3')
Out[31]:
(np.float64(-0.5), np.float64(764.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1d43681f0>
Out[31]:
<matplotlib.image.AxesImage at 0x70c2142a3fd0>
Out[31]:
Text(0.5, 1.0, 'IoU: 0.548 | Image ID: 0071921377601_1')
Out[31]:
(np.float64(-0.5), np.float64(767.5), np.float64(1023.5), np.float64(-0.5))
Out[31]:
<matplotlib.legend.Legend at 0x70c1b4013970>
βœ… Saved 123 images to: qwen2-7b-nutrition-baseline/all_predictions

================================================================================
βœ… BASELINE EVALUATION COMPLETE!
================================================================================

πŸ“ All outputs in: qwen2-7b-nutrition-baseline/
   β€’ baseline_metrics.json
   β€’ baseline_iou_distribution.png
   β€’ baseline_sample_results.csv
   β€’ all_predictions/ (123 images)

================================================================================

🎯 Ready to compare with fine-tuned experiments!
InΒ [Β ]: